Summary: | The work done in this thesis encapsulates an area of inquiry that has seen surprisingly little research in the broad and rapidly developing field of knowledge discovery. Both the Data Mining and Data Warehousing disciplines, which lie under the large umbrella of knowledge discovery, are well established in their own right but very little cross-fertilization has taken place between these two disciplines. The integration of Data Mining techniques with Data Warehousing is gaining popularity due to the fact that both disciplines complement each other in extracting knowledge from large datasets. However, the majority of approaches focus on applying data mining as a front end technology to mine data warehouses. Relatively little progress has been made in incorporating mining techniques in the design of data warehouses. Recently though, there has been increasing interest in adapting clustering techniques that have been developed in the Data Mining discipline to the Data Warehousing environment. Such an adaptation is not easy from a technical viewpoint as a resource such as a Data Warehouse has generally a large community of users, each of which may potentially have different and conflicting data requirements which in turn translates to different clustering requirements for the same data resource. While methods such as data clustering applied on multidimensional data have been shown to enhance the knowledge discovery process, a number of fundamental issues remain unresolved with respect to the design of multidimensional schema which is an integral part of any Data Warehouse. These relate to automated support for the selection of informative dimension and fact variables in high dimensional and data intensive environments, an activity which may challenge the capabilities of human designers on account of the sheer scale of data volume and variables involved. In this thesis, a novel methodology is proposed which facilitates knowledge workers to select a subset of dimension and fact variables from an initial large set of candidates for the discovery of interesting data cube regions. Unlike previous research in this area, the proposed approach does not rely on the availability of specialized domain knowledge and instead makes use of robust methods of data reduction such as Principal Component Analysis, and Multiple Correspondence Analysis to identify a small subset of numeric and nominal variables that are responsible for capturing the greatest degree of variation in the data and are thus used in generating data cubes of interest. Moreover, information theoretic measures such as Entropy and Information Gain have been exploited to filter out less informative dimensions to construct compact, useful and easily manageable schema. In terms of data analysis, we experiment with association rule mining to compare the rules generated with semi automatically generated schema with the rules gathered without the presence of such schema. The three case studies that were conducted on real word datasets taken from UCI machine learning repository revealed that the methodology was able to capture regions of interest in data cubes that were significant from both the application and statistical perspectives. Additionally, the knowledge discovered in the form of rules from the generated schema was more diverse, informative and have better prediction accuracy than the standard approach of mining the original data without the use of our methodology-driven multidimensional structure imposed on it.
|