A Decomposable Algorithm For Mining Frequent Itemsets In A Distributed Parallel Environment

碩士 === 國立中正大學 === 資訊管理學系暨研究所 === 100 === Knowledge discovery in databases (KDD), also called data mining, is an attractive issue in the realm of academic and business research. Frequent itemsets mining performs an essential role since it is a primary stage of association analysis. At the present tim...

Full description

Bibliographic Details
Main Authors: Huang, Chinghan, 黃靖涵
Other Authors: Wu, Fan
Format: Others
Language:en_US
Published: 2012
Online Access:http://ndltd.ncl.edu.tw/handle/05877913985952152719
Description
Summary:碩士 === 國立中正大學 === 資訊管理學系暨研究所 === 100 === Knowledge discovery in databases (KDD), also called data mining, is an attractive issue in the realm of academic and business research. Frequent itemsets mining performs an essential role since it is a primary stage of association analysis. At the present time, many methods widely adopt a distributed-parallel approach to enhance time efficiency; however, it is still inadequate. The prime reason is that in previous studies the task of discovering frequent itemsets cannot be performed completely as a seamless and concurrent way. In this thesis, based on the distributed-parallel strategy and item-transformation technology, we devise a decomposable algorithm, named D-Mining, for mining frequent itemsets. Proved by mathematical induction, D-Mining is correct in terms of the occurrence of itemsets and the number of itemsets. Furthermore, the experimental results demonstrate that D-Mining possesses a high return on the investment of computation resource by comparison with the previous study pp-tree that is also designed for distributed parallel environment. Given the same parameters, that is, every distributed local site has multiple CPUs; and then D-Mining is much more efficient than pp-tree. In particular, D-Mining is stable and appears to the similar efficiency even though D-Mining operates in tough propositions such as: average length of transactions is long, the number of items is large, and the value of support threshold is small.