Summary: | 碩士 === 義守大學 === 資訊工程學系 === 90 === Data mining is the process of extracting desirable knowledge or interesting patterns from existing databases for specific purposes. Among the techniques proposed, finding association rules or sequential patterns from transaction databases is most commonly seen in data mining. In the past, many algorithms for mining association rules or sequential patterns from transactions were proposed, most of which were executed in level-wise processes. In this paper, we propose novel mining algorithms to improve the efficiency of finding large itemsets or sequential patterns.
In the first part of this thesis, we propose a novel mining algorithm to improve the efficiency of finding large itemsets for association rules. The proposed algorithm bases on Denwattana and Getta’ of prediction concept and considers the data dependency in the given transactions. It aims at efficiently finding any p levels of large itemsets by scanning a database twice except for the first level. A new reasonable estimation method is proposed to predict promising and non-promising candidate itemsets flexibly.
In addition to mining association rules, mining sequential patterns are also very important to real applications. It is even more difficult than mining from association rules. In the second part of this thesis, we thus try to extend our first approach to efficiently tackle the problem of mining sequential patterns. The proposed approach can be roughly divided into two parts. In the first part, any p levels of large itemsets are found by scanning a database twice. The large itemsets are then used in the second part as the large 1-sequences. Then any p levels of large sequences are found by further scanning the database twice. It is thus expected to provide a flexible and efficient way to finding sequential patterns from large databases.
Experimental results show that the proposed approach for finding association rules has a better efficiency than the apriori algorithm when the minimum support value is not set at a large value. This is because when the minimum support values are quite large, the numbers of large itemsets will become very small. The time saved due to the pruning of candidate itemsets in the proposed algorithm will not cover the additional overhead. The proposed algorithm is thus suitable for low or middle minimum support values.
|