Distributedly Mining Frequent Numerical Patterns in Multi-Dimensional Databases

碩士 === 國立臺灣大學 === 資訊管理學研究所 === 99 === Most previously proposed methods mine frequent patterns from symbolic databases, where numerical data may be transformed numerical data into a symbolic representation by a list of breakpoints. However, using different breakpoints to transform numerical data into...

Full description

Bibliographic Details
Main Authors: Yen-Chi Chen, 陳彥琦
Other Authors: Anthony JT Lee
Format: Others
Language:en_US
Published: 2011
Online Access:http://ndltd.ncl.edu.tw/handle/20247575900256404797
Description
Summary:碩士 === 國立臺灣大學 === 資訊管理學研究所 === 99 === Most previously proposed methods mine frequent patterns from symbolic databases, where numerical data may be transformed numerical data into a symbolic representation by a list of breakpoints. However, using different breakpoints to transform numerical data into a symbolic representation can result in different patterns. Thus, it is better to directly mine numerical patterns without any transformation. Mining numerical patterns from databases is one of the most computation-intensive applications, which can be dealt with by using the cloud computing infrastructure. To the best of our knowledge, there is no algorithm dedicated to mining frequent numerical patterns on the cloud. Therefore, in this thesis, we propose an efficient algorithm for mining frequent numerical patterns in a multi-dimensional database on the MapReduce framework, where every transaction in the database contains multiple numerical values. The proposed algorithm is composed of three MapReduce jobs. The first job is to mine frequent patterns of length one (1-patterns) in parallel. The second job gathers the information about frequent 1-patterns mined, and utilizes the information to divide the mining process into nearly-equally sized tasks. The third job distributes these tasks to different worker instances, each of which recursively mines frequent patterns in a depth-first search (DFS) manner until no more frequent patterns can be found. During the mining process, we employ two effective speedup strategies to form tasks of nearly-equal size and balance the workload, and an approach to divide a multi-dimensional database into independent partitions so that the mining tasks can be performed independently and parallelly. Therefore, the proposed method can efficiently mine frequent numerical patterns in a multi-dimensional database. The experimental results show that the DNM algorithm outperforms the modified Partition-Apriori algorithm in orders of magnitude.