An efficient distributed system for mining sequential patterns

碩士 === 淡江大學 === 資訊管理學系碩士班 === 94 === In this thesis, an effective distributed mining system is developed to increase the efficiency of sequential pattern mining. Two important concerns for distributed mining, load balance and communication overhead reduction, are carefully examined in our works. To...

Full description

Bibliographic Details
Main Authors: Yang-Chih Huang, 黃揚智
Other Authors: 張昭憲
Format: Others
Language:zh-TW
Published: 2006
Online Access:http://ndltd.ncl.edu.tw/handle/89380959986555298017
Description
Summary:碩士 === 淡江大學 === 資訊管理學系碩士班 === 94 === In this thesis, an effective distributed mining system is developed to increase the efficiency of sequential pattern mining. Two important concerns for distributed mining, load balance and communication overhead reduction, are carefully examined in our works. To achieve a good load balance, accurate predictions on work load of subtasks are indispensable. In addition, the independency of subtask should be effectively controlled to decrease the communication overhead among nodes. According to above objectives, the work load prediction methods used in the previous works are analyzed at first. We finds that the commonly-used prediction method does not show a well metrics for work load. Eventually, a skew presented in task partition occurs frequently due to the inaccurate prediction. Besides, we also found that, the efficiency of performing a sequence mining algorithm on a transaction database is strongly related to the distribution of sequential patterns in this database. To increase the mining efficiency, we should choose a proper mining algorithm for each database but not apply a single algorithm for all databases with different features. Base on the above observations, a novel distributed data mining algorithm, Segmented Dynamic Load Balance algorithm (SDLB), is developed. Different previous works, SDLB divides subtask dispatches into several stages. According to different situations, the static and dynamic load balance method are applied adeptly to prevent the task partition from skew and reduce the communication overhead simultaneously. Furthermore, SDLB inducts a brand-new concept that the nodes may adopt different basic mining algorithm for subtasks with different characteristics to promote the mining efficiency. In comparison with the previous works, the experimental results shows SDLB can effectively reduce the runtime and obtain a better speed-up ratio. This result demonstrates the potentials of SDLB for mining sequential pattern in Very Large Data Bases (VLDBs).