Summary: | The objective of the research presented in this thesis is to investigate and evaluate a series of techniques directed at identifying banded patterns in zero-one data, starting from 2D data and then increasing the number of dimensions to be considered to 3D and then ND. To this end the term Banded Pattern Mining (BPM) has been coined; the process of extracting hidden banded patterns from data. BPM has wide applicability in areas where the domain of interest can be represented in the form of a matrix holding 1s and 0s. Five BPM algorithms are proposed (and a number of variations) directed at finding bandings in zero-one data: (i) 2D-BPM, (ii) Approximate 3D-BPM, (iii) Exact 3D-BPM, (iv) Approximate ND (AND) and (iv) Exact ND (END) BPM. This thesis describes and discusses each of these algorithms in detail. The main challenges of BPM are: (i) how best to identify bandings in 2D data sets without the need to consider large numbers of permutations, (ii) how to address situations where there is a possibility of multiple dots being located at individual locations in a ND zero-one (dot) data space of interest and (iii) how best to identify bandings with respect to large ND data sets that cannot be held in primary storage. To address the first issue a banding score mechanism is proposed that avoids the need to consider large numbers of permutations. This has been incorporated into the 2D-BPM algorithm. To address the second issue, the idea was to use a 'multiple dot' mechanism; a mechanism in the context of both the approximate and exact BPM algorithms that includes the possibility of some cells in the data space of interest holding more than one dot. To address the third issue, sampling and segmentation techniques were proposed to identify bandings in large ND data sets. Full evaluations of each of the BPM algorithms are presented. For evaluation purpose the data sets used were categorised as follows: (i) randomly generated synthetic data sets, (ii) UCI data sets and (iii) a specific application; the Great Britain (GB) Cattle Tracing System (CTS) database in operation in GB, from which 5D binary valued data sets were extracted and used. In the latter case the dimensions were: (i) records (number of animal movements), (ii) attributes, (iii) sender easting (x coordinate holding area), (iv) sender northing (y-coordinate holding areas) and (v) time (month). Other Banding application domains that could have been considered include: (i) network analysis, (ii) co-occurence analysis, (iii) VLSI chip design and (iv) graph drawing. An independent metric, Average Band Width (ABW), was proposed and used to measure the quality of bandings and provide a mechanism for the comparison of BPM algorithms. More specifically, data sets from 2003 to 2006 across four specific counties in GB were used; Aberdeenshire, Cornwall, Lancashire and Norfolk. The reported evaluation indicates that the use of approximate BPM (rather than exact BPM) produces more efficient results in terms of run-time, whilst the use of exact BPM provided promising results in terms of the quality of the bandings produced. The reported evaluation also indicates that a sound foundation has been established for future work with respect to high performance computing variations of the proposed BPM algorithms.
|