A Study of Efficient Data Warehouse and Data Mining Techniques
博士 === 逢甲大學 === 資訊工程所 === 94 === Data mining deals with the discovery of hidden knowledge, unexpected patterns, and new rules from large databases. It is currently regarded as the key element of a much more elaborate process called knowledge discovery in databases (KDD), which is closely linked to a...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2006
|
Online Access: | http://ndltd.ncl.edu.tw/handle/70389186760743577809 |
id |
ndltd-TW-094FCU05392052 |
---|---|
record_format |
oai_dc |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
博士 === 逢甲大學 === 資訊工程所 === 94 === Data mining deals with the discovery of hidden knowledge, unexpected patterns, and new rules from large databases. It is currently regarded as the key element of a much more elaborate process called knowledge discovery in databases (KDD), which is closely linked to another important development, data warehousing. The combination of data warehousing, decision support, and data mining indicates an innovative and totally new approach to information management. In this dissertation, we concentrated on research for this newly emerging field. We investigate three majors portions of the mining process and warehousing, one presents data warehousing with optimal utilization of materialized views, another proposes a novel algorithm to mining association rules using merged transactions approach, and the other presents an efficient algorithm to implement a k-means clustering and proposes an efficient algorithm, namely spFCM for fast fuzzy clustering.
View materialization is an effective method to increase query efficiency in a data warehouse and improve OLAP query performance, providing the basis for integration with a data mining process. However, one encounters the problem of space insufficiency if all possible views are materialized in advance. Reducing query time by means of selecting a proper set of materialized views with a lower cost is crucial for efficient data warehousing. In addition, the costs of data warehouse creation, query, and maintenance have to be taken into account while views are materialized. In this study, efficient algorithms are proposed to select a proper set of materialized views, constrained by storage and cost considerations, to improve performance of the entire data warehousing process. Also, a cost model for data warehouse query and maintenance is derived to effectively exploit the gain and loss metrics in the use of efficient view selection algorithms. The main contribution of the proposed approach is to dramatically improve the efficiency of the selection process of materialized views, thereby greatly reducing the overall cost of data warehouse query and maintenance.
Mining association rules among sets of items in a large database has been widely investigated. In the Apriori-like algorithms, transactions are not stored in memory and multiple database scans are required; this approach is computationally more expensive. A different approach is proposed in this study, based on the use of three data preprocess methods, to group transactions in a database and devise an efficient algorithm to mine association rules from the merged transactions with much better performance. The first data preprocess method sorts transactions to classify the groups. The other two approaches use a simplified dynamic programming algorithm to merge transactions. Compared with other algorithms, our proposed method performs much better with less I/O overhead. The experimental results show that after preprocess only one scan of a database is required in the data mining stage to generate association rules. In addition, the proposed method is especially suitable for very large databases and can be applied in an incremental fashion.
The k-means algorithm is one of the most widely used methods to partition a dataset into groups of patterns. However, most k-means methods require expensive distance calculations for centroids to achieve convergence. This study presents an efficient algorithm to implement a k-means clustering that produces clusters comparable to other methods with lower computational cost. In the proposed algorithm, the original dataset is partitioned into blocks; each block unit, called a Unit Block (UB), contains at least one pattern. The centroid of a unit block (CUB) can be located by using a simple calculation. All the computed CUBs form a reduced dataset that represents the original dataset. The reduced dataset of CUBs is then used to compute the final centroid of the original dataset. Only each UB on the boundary of candidate clusters needs to be examined to find the closest final centroid for every pattern in the UB. In this way, it is possible to dramatically reduce the time for calculating final converged centroids. Experimental results indicate that this algorithm produces compatible clustering results produced by other k-means algorithms, but with much less computational cost.
The fuzzy c-means (FCM) algorithm, commonly used for clustering, can provide a data partition that is better and more meaningful than hard clustering approaches. The performance of the FCM algorithm depends on the selection of an initial cluster center and/or its initial membership value. If a good initial cluster center is found close to the actual final cluster center, the FCM algorithm will converge very quickly and the processing time can be drastically reduced. We propose a novel algorithm based on the FCM clustering, which significantly reduces the required computation time by using a simple partitioning approach. In the first phase of the proposed algorithm, called spFCM, a reduced dataset is derived from the original source dataset to estimate the initial cluster centers, which are used in the second phase to find the actual cluster centers. It is possible to empirically evaluate the performance of the spFCM algorithm using synthetic datasets that exhibit varying cluster sizes and distributions. Observations of the use of synthetic datasets indicate that the proposed spFCM algorithm is on an average five times faster than the original FCM algorithm. Additionally, the quality of the proposed spFCM algorithm is compatible with that of the FCM algorithm.
|
author2 |
Don-Lin Yang |
author_facet |
Don-Lin Yang Ming-Chuan Hung 洪明傳 |
author |
Ming-Chuan Hung 洪明傳 |
spellingShingle |
Ming-Chuan Hung 洪明傳 A Study of Efficient Data Warehouse and Data Mining Techniques |
author_sort |
Ming-Chuan Hung |
title |
A Study of Efficient Data Warehouse and Data Mining Techniques |
title_short |
A Study of Efficient Data Warehouse and Data Mining Techniques |
title_full |
A Study of Efficient Data Warehouse and Data Mining Techniques |
title_fullStr |
A Study of Efficient Data Warehouse and Data Mining Techniques |
title_full_unstemmed |
A Study of Efficient Data Warehouse and Data Mining Techniques |
title_sort |
study of efficient data warehouse and data mining techniques |
publishDate |
2006 |
url |
http://ndltd.ncl.edu.tw/handle/70389186760743577809 |
work_keys_str_mv |
AT mingchuanhung astudyofefficientdatawarehouseanddataminingtechniques AT hóngmíngchuán astudyofefficientdatawarehouseanddataminingtechniques AT mingchuanhung yǒuxiàolǜdezīliàocāngchǔyǔzīliàowājuéjìshùzhīyánjiū AT hóngmíngchuán yǒuxiàolǜdezīliàocāngchǔyǔzīliàowājuéjìshùzhīyánjiū AT mingchuanhung studyofefficientdatawarehouseanddataminingtechniques AT hóngmíngchuán studyofefficientdatawarehouseanddataminingtechniques |
_version_ |
1718147589025562624 |
spelling |
ndltd-TW-094FCU053920522015-12-11T04:04:18Z http://ndltd.ncl.edu.tw/handle/70389186760743577809 A Study of Efficient Data Warehouse and Data Mining Techniques 有效率的資料倉儲與資料挖掘技術之研究 Ming-Chuan Hung 洪明傳 博士 逢甲大學 資訊工程所 94 Data mining deals with the discovery of hidden knowledge, unexpected patterns, and new rules from large databases. It is currently regarded as the key element of a much more elaborate process called knowledge discovery in databases (KDD), which is closely linked to another important development, data warehousing. The combination of data warehousing, decision support, and data mining indicates an innovative and totally new approach to information management. In this dissertation, we concentrated on research for this newly emerging field. We investigate three majors portions of the mining process and warehousing, one presents data warehousing with optimal utilization of materialized views, another proposes a novel algorithm to mining association rules using merged transactions approach, and the other presents an efficient algorithm to implement a k-means clustering and proposes an efficient algorithm, namely spFCM for fast fuzzy clustering. View materialization is an effective method to increase query efficiency in a data warehouse and improve OLAP query performance, providing the basis for integration with a data mining process. However, one encounters the problem of space insufficiency if all possible views are materialized in advance. Reducing query time by means of selecting a proper set of materialized views with a lower cost is crucial for efficient data warehousing. In addition, the costs of data warehouse creation, query, and maintenance have to be taken into account while views are materialized. In this study, efficient algorithms are proposed to select a proper set of materialized views, constrained by storage and cost considerations, to improve performance of the entire data warehousing process. Also, a cost model for data warehouse query and maintenance is derived to effectively exploit the gain and loss metrics in the use of efficient view selection algorithms. The main contribution of the proposed approach is to dramatically improve the efficiency of the selection process of materialized views, thereby greatly reducing the overall cost of data warehouse query and maintenance. Mining association rules among sets of items in a large database has been widely investigated. In the Apriori-like algorithms, transactions are not stored in memory and multiple database scans are required; this approach is computationally more expensive. A different approach is proposed in this study, based on the use of three data preprocess methods, to group transactions in a database and devise an efficient algorithm to mine association rules from the merged transactions with much better performance. The first data preprocess method sorts transactions to classify the groups. The other two approaches use a simplified dynamic programming algorithm to merge transactions. Compared with other algorithms, our proposed method performs much better with less I/O overhead. The experimental results show that after preprocess only one scan of a database is required in the data mining stage to generate association rules. In addition, the proposed method is especially suitable for very large databases and can be applied in an incremental fashion. The k-means algorithm is one of the most widely used methods to partition a dataset into groups of patterns. However, most k-means methods require expensive distance calculations for centroids to achieve convergence. This study presents an efficient algorithm to implement a k-means clustering that produces clusters comparable to other methods with lower computational cost. In the proposed algorithm, the original dataset is partitioned into blocks; each block unit, called a Unit Block (UB), contains at least one pattern. The centroid of a unit block (CUB) can be located by using a simple calculation. All the computed CUBs form a reduced dataset that represents the original dataset. The reduced dataset of CUBs is then used to compute the final centroid of the original dataset. Only each UB on the boundary of candidate clusters needs to be examined to find the closest final centroid for every pattern in the UB. In this way, it is possible to dramatically reduce the time for calculating final converged centroids. Experimental results indicate that this algorithm produces compatible clustering results produced by other k-means algorithms, but with much less computational cost. The fuzzy c-means (FCM) algorithm, commonly used for clustering, can provide a data partition that is better and more meaningful than hard clustering approaches. The performance of the FCM algorithm depends on the selection of an initial cluster center and/or its initial membership value. If a good initial cluster center is found close to the actual final cluster center, the FCM algorithm will converge very quickly and the processing time can be drastically reduced. We propose a novel algorithm based on the FCM clustering, which significantly reduces the required computation time by using a simple partitioning approach. In the first phase of the proposed algorithm, called spFCM, a reduced dataset is derived from the original source dataset to estimate the initial cluster centers, which are used in the second phase to find the actual cluster centers. It is possible to empirically evaluate the performance of the spFCM algorithm using synthetic datasets that exhibit varying cluster sizes and distributions. Observations of the use of synthetic datasets indicate that the proposed spFCM algorithm is on an average five times faster than the original FCM algorithm. Additionally, the quality of the proposed spFCM algorithm is compatible with that of the FCM algorithm. Don-Lin Yang 楊東麟 2006 學位論文 ; thesis 123 en_US |