A Study of Efficient Data Warehouse and Data Mining Techniques

博士 === 逢甲大學 === 資訊工程所 === 94 === Data mining deals with the discovery of hidden knowledge, unexpected patterns, and new rules from large databases. It is currently regarded as the key element of a much more elaborate process called knowledge discovery in databases (KDD), which is closely linked to a...

Full description

Bibliographic Details
Main Authors: Ming-Chuan Hung, 洪明傳
Other Authors: Don-Lin Yang
Format: Others
Language:en_US
Published: 2006
Online Access:http://ndltd.ncl.edu.tw/handle/70389186760743577809
id ndltd-TW-094FCU05392052
record_format oai_dc
collection NDLTD
language en_US
format Others
sources NDLTD
description 博士 === 逢甲大學 === 資訊工程所 === 94 === Data mining deals with the discovery of hidden knowledge, unexpected patterns, and new rules from large databases. It is currently regarded as the key element of a much more elaborate process called knowledge discovery in databases (KDD), which is closely linked to another important development, data warehousing. The combination of data warehousing, decision support, and data mining indicates an innovative and totally new approach to information management. In this dissertation, we concentrated on research for this newly emerging field. We investigate three majors portions of the mining process and warehousing, one presents data warehousing with optimal utilization of materialized views, another proposes a novel algorithm to mining association rules using merged transactions approach, and the other presents an efficient algorithm to implement a k-means clustering and proposes an efficient algorithm, namely spFCM for fast fuzzy clustering. View materialization is an effective method to increase query efficiency in a data warehouse and improve OLAP query performance, providing the basis for integration with a data mining process. However, one encounters the problem of space insufficiency if all possible views are materialized in advance. Reducing query time by means of selecting a proper set of materialized views with a lower cost is crucial for efficient data warehousing. In addition, the costs of data warehouse creation, query, and maintenance have to be taken into account while views are materialized. In this study, efficient algorithms are proposed to select a proper set of materialized views, constrained by storage and cost considerations, to improve performance of the entire data warehousing process. Also, a cost model for data warehouse query and maintenance is derived to effectively exploit the gain and loss metrics in the use of efficient view selection algorithms. The main contribution of the proposed approach is to dramatically improve the efficiency of the selection process of materialized views, thereby greatly reducing the overall cost of data warehouse query and maintenance. Mining association rules among sets of items in a large database has been widely investigated. In the Apriori-like algorithms, transactions are not stored in memory and multiple database scans are required; this approach is computationally more expensive. A different approach is proposed in this study, based on the use of three data preprocess methods, to group transactions in a database and devise an efficient algorithm to mine association rules from the merged transactions with much better performance. The first data preprocess method sorts transactions to classify the groups. The other two approaches use a simplified dynamic programming algorithm to merge transactions. Compared with other algorithms, our proposed method performs much better with less I/O overhead. The experimental results show that after preprocess only one scan of a database is required in the data mining stage to generate association rules. In addition, the proposed method is especially suitable for very large databases and can be applied in an incremental fashion. The k-means algorithm is one of the most widely used methods to partition a dataset into groups of patterns. However, most k-means methods require expensive distance calculations for centroids to achieve convergence. This study presents an efficient algorithm to implement a k-means clustering that produces clusters comparable to other methods with lower computational cost. In the proposed algorithm, the original dataset is partitioned into blocks; each block unit, called a Unit Block (UB), contains at least one pattern. The centroid of a unit block (CUB) can be located by using a simple calculation. All the computed CUBs form a reduced dataset that represents the original dataset. The reduced dataset of CUBs is then used to compute the final centroid of the original dataset. Only each UB on the boundary of candidate clusters needs to be examined to find the closest final centroid for every pattern in the UB. In this way, it is possible to dramatically reduce the time for calculating final converged centroids. Experimental results indicate that this algorithm produces compatible clustering results produced by other k-means algorithms, but with much less computational cost. The fuzzy c-means (FCM) algorithm, commonly used for clustering, can provide a data partition that is better and more meaningful than hard clustering approaches. The performance of the FCM algorithm depends on the selection of an initial cluster center and/or its initial membership value. If a good initial cluster center is found close to the actual final cluster center, the FCM algorithm will converge very quickly and the processing time can be drastically reduced. We propose a novel algorithm based on the FCM clustering, which significantly reduces the required computation time by using a simple partitioning approach. In the first phase of the proposed algorithm, called spFCM, a reduced dataset is derived from the original source dataset to estimate the initial cluster centers, which are used in the second phase to find the actual cluster centers. It is possible to empirically evaluate the performance of the spFCM algorithm using synthetic datasets that exhibit varying cluster sizes and distributions. Observations of the use of synthetic datasets indicate that the proposed spFCM algorithm is on an average five times faster than the original FCM algorithm. Additionally, the quality of the proposed spFCM algorithm is compatible with that of the FCM algorithm.
author2 Don-Lin Yang
author_facet Don-Lin Yang
Ming-Chuan Hung
洪明傳
author Ming-Chuan Hung
洪明傳
spellingShingle Ming-Chuan Hung
洪明傳
A Study of Efficient Data Warehouse and Data Mining Techniques
author_sort Ming-Chuan Hung
title A Study of Efficient Data Warehouse and Data Mining Techniques
title_short A Study of Efficient Data Warehouse and Data Mining Techniques
title_full A Study of Efficient Data Warehouse and Data Mining Techniques
title_fullStr A Study of Efficient Data Warehouse and Data Mining Techniques
title_full_unstemmed A Study of Efficient Data Warehouse and Data Mining Techniques
title_sort study of efficient data warehouse and data mining techniques
publishDate 2006
url http://ndltd.ncl.edu.tw/handle/70389186760743577809
work_keys_str_mv AT mingchuanhung astudyofefficientdatawarehouseanddataminingtechniques
AT hóngmíngchuán astudyofefficientdatawarehouseanddataminingtechniques
AT mingchuanhung yǒuxiàolǜdezīliàocāngchǔyǔzīliàowājuéjìshùzhīyánjiū
AT hóngmíngchuán yǒuxiàolǜdezīliàocāngchǔyǔzīliàowājuéjìshùzhīyánjiū
AT mingchuanhung studyofefficientdatawarehouseanddataminingtechniques
AT hóngmíngchuán studyofefficientdatawarehouseanddataminingtechniques
_version_ 1718147589025562624
spelling ndltd-TW-094FCU053920522015-12-11T04:04:18Z http://ndltd.ncl.edu.tw/handle/70389186760743577809 A Study of Efficient Data Warehouse and Data Mining Techniques 有效率的資料倉儲與資料挖掘技術之研究 Ming-Chuan Hung 洪明傳 博士 逢甲大學 資訊工程所 94 Data mining deals with the discovery of hidden knowledge, unexpected patterns, and new rules from large databases. It is currently regarded as the key element of a much more elaborate process called knowledge discovery in databases (KDD), which is closely linked to another important development, data warehousing. The combination of data warehousing, decision support, and data mining indicates an innovative and totally new approach to information management. In this dissertation, we concentrated on research for this newly emerging field. We investigate three majors portions of the mining process and warehousing, one presents data warehousing with optimal utilization of materialized views, another proposes a novel algorithm to mining association rules using merged transactions approach, and the other presents an efficient algorithm to implement a k-means clustering and proposes an efficient algorithm, namely spFCM for fast fuzzy clustering. View materialization is an effective method to increase query efficiency in a data warehouse and improve OLAP query performance, providing the basis for integration with a data mining process. However, one encounters the problem of space insufficiency if all possible views are materialized in advance. Reducing query time by means of selecting a proper set of materialized views with a lower cost is crucial for efficient data warehousing. In addition, the costs of data warehouse creation, query, and maintenance have to be taken into account while views are materialized. In this study, efficient algorithms are proposed to select a proper set of materialized views, constrained by storage and cost considerations, to improve performance of the entire data warehousing process. Also, a cost model for data warehouse query and maintenance is derived to effectively exploit the gain and loss metrics in the use of efficient view selection algorithms. The main contribution of the proposed approach is to dramatically improve the efficiency of the selection process of materialized views, thereby greatly reducing the overall cost of data warehouse query and maintenance. Mining association rules among sets of items in a large database has been widely investigated. In the Apriori-like algorithms, transactions are not stored in memory and multiple database scans are required; this approach is computationally more expensive. A different approach is proposed in this study, based on the use of three data preprocess methods, to group transactions in a database and devise an efficient algorithm to mine association rules from the merged transactions with much better performance. The first data preprocess method sorts transactions to classify the groups. The other two approaches use a simplified dynamic programming algorithm to merge transactions. Compared with other algorithms, our proposed method performs much better with less I/O overhead. The experimental results show that after preprocess only one scan of a database is required in the data mining stage to generate association rules. In addition, the proposed method is especially suitable for very large databases and can be applied in an incremental fashion. The k-means algorithm is one of the most widely used methods to partition a dataset into groups of patterns. However, most k-means methods require expensive distance calculations for centroids to achieve convergence. This study presents an efficient algorithm to implement a k-means clustering that produces clusters comparable to other methods with lower computational cost. In the proposed algorithm, the original dataset is partitioned into blocks; each block unit, called a Unit Block (UB), contains at least one pattern. The centroid of a unit block (CUB) can be located by using a simple calculation. All the computed CUBs form a reduced dataset that represents the original dataset. The reduced dataset of CUBs is then used to compute the final centroid of the original dataset. Only each UB on the boundary of candidate clusters needs to be examined to find the closest final centroid for every pattern in the UB. In this way, it is possible to dramatically reduce the time for calculating final converged centroids. Experimental results indicate that this algorithm produces compatible clustering results produced by other k-means algorithms, but with much less computational cost. The fuzzy c-means (FCM) algorithm, commonly used for clustering, can provide a data partition that is better and more meaningful than hard clustering approaches. The performance of the FCM algorithm depends on the selection of an initial cluster center and/or its initial membership value. If a good initial cluster center is found close to the actual final cluster center, the FCM algorithm will converge very quickly and the processing time can be drastically reduced. We propose a novel algorithm based on the FCM clustering, which significantly reduces the required computation time by using a simple partitioning approach. In the first phase of the proposed algorithm, called spFCM, a reduced dataset is derived from the original source dataset to estimate the initial cluster centers, which are used in the second phase to find the actual cluster centers. It is possible to empirically evaluate the performance of the spFCM algorithm using synthetic datasets that exhibit varying cluster sizes and distributions. Observations of the use of synthetic datasets indicate that the proposed spFCM algorithm is on an average five times faster than the original FCM algorithm. Additionally, the quality of the proposed spFCM algorithm is compatible with that of the FCM algorithm. Don-Lin Yang 楊東麟 2006 學位論文 ; thesis 123 en_US