Summary: | 碩士 === 國立中興大學 === 電機工程學系所 === 101 === Sequence data are ubiquitous in our daily life, such as animals’ seasonal migration, DNA/protein sequences, Web browsing sequences. Sequence pattern mining is to discover special, important, and representative features hidden in sequence data. It attracts a lot of attention especially in the domains of bioinformatics and spatio-temporal trajectory data mining. Sequence data are inherently of some uncertainty, and the uncertainty may be caused by many reasons, such as limitations of the measuring technology, sampling error, privacy preserving. In this thesis, we focus on the mining of uncertain sequences to discover hidden patterns by using Probabilistic Suffix Tree (PST). PST is an implementation of Variable-length Markov Model (VMM) that is wildly used in sequence pattern mining in many domains. However, traditional PST building algorithm is designed to mine certain data and inapplicable of mining uncertain sequences. In addition, traditional PST building algorithm is a centralized algorithm such that it is incapable of handling huge amounts of accumulated uncertain data. Therefore, we propose a distributed and parallel algorithm on Hadoop platform to fully utilize the computing power of cloud computing in the uncertain sequence mining.
In the thesis, we propose two distributed and parallel PST building algorithms, named CloudPST+ and CloudPST+_OneScan respectively on the Hadoop platform to speed up the learning process. Specifically, CloudPST+ is of Map/Reduce framework that builds a PST in an iterative and levels by levels manner to avoid learning excessive patterns and trade off the overhead of distributed computing. CloudPST+_OneScan extends CloudPST+ and involves a new data structure to store the intermediate statistics so that the One-Scan algorithm only scans the entire sequence data once in each iteration.
To evaluate the performance of CloudPST+ and CloudPST+_OneScan, we implement a naïve approach derived from the well-known Wordcount example of Hadoop/MapReduce and conduct several experiments with a synthetic dataset that is re-generated from a real trajectory dataset. According to our experimental results, CloudPST+ and CloudPST+_OneScan significantly outperform the naïve approach. In addition, sacrificing an extra memory cost, CloudPST+_OneScan shows better efficiency, scalability, and stability than CloudPST+.
|