Data Mining Based on MapReduce Technology
碩士 === 國立雲林科技大學 === 資訊工程系 === 106 === The widespread use of Internet makes the data easy to share and disseminate, but how to find out important information from massive data becomes an issue. Data mining is one of the technologies to find out important information and it will get faster mining resu...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
2018
|
Online Access: | http://ndltd.ncl.edu.tw/handle/ak4v53 |
id |
ndltd-TW-106YUNT0392035 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-106YUNT03920352019-05-16T00:44:54Z http://ndltd.ncl.edu.tw/handle/ak4v53 Data Mining Based on MapReduce Technology 基於MapReduce對於資料探勘技術之研製 WU, I-CHUN 吳亦鈞 碩士 國立雲林科技大學 資訊工程系 106 The widespread use of Internet makes the data easy to share and disseminate, but how to find out important information from massive data becomes an issue. Data mining is one of the technologies to find out important information and it will get faster mining result if more computers are used. Thus, our data mining method is designed to run on Hadoop distributed platform to speed up the mining time. Before mining, each item of a transaction of an original dataset is transformed to 1 or 0; 1 means that the transaction contains the corresponding item and 0 is no. That the transformation of item into binary notation efficiently speeds up the mining work since only simple logic operands to be executed at most time. This paper proposes Brute Force and Candidate Itemset two algorithms to find out the frequent itemsets. Brute Force uses the exhaustive method and Candidate Itemset is based on the Apriori method to prune unnecessary candidate itemsets by the Apriori property not to run all combinations of items that Brute Force does. In addition, sequence mining algorithm is proposed to find out sequential relationships among frequent items. The datasets of the mushrooms, chess, c10d20k [26] are run on Hadoop by the methods of ours, Emin[21] and Li[15]. The experimental results show that for the mushrooms dataset, our mining time is 48% faster than Emin when the threshold is 0.15 and is 63% faster than Li when the threshold is 0.45. For the chess dataset, our mining time is 22% and 97% faster than Emin and Li when the threshold is 0.55. For the c10d20k dataset, our mining time is 50% and 99% faster than Emin and Li when the threshold is 0.15. WUU, LIH-CHYAU 伍麗樵 2018 學位論文 ; thesis 51 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立雲林科技大學 === 資訊工程系 === 106 === The widespread use of Internet makes the data easy to share and disseminate, but how to find out important information from massive data becomes an issue. Data mining is one of the technologies to find out important information and it will get faster mining result if more computers are used. Thus, our data mining method is designed to run on Hadoop distributed platform to speed up the mining time. Before mining, each item of a transaction of an original dataset is transformed to 1 or 0; 1 means that the transaction contains the corresponding item and 0 is no. That the transformation of item into binary notation efficiently speeds up the mining work since only simple logic operands to be executed at most time.
This paper proposes Brute Force and Candidate Itemset two algorithms to find out the frequent itemsets. Brute Force uses the exhaustive method and Candidate Itemset is based on the Apriori method to prune unnecessary candidate itemsets by the Apriori property not to run all combinations of items that Brute Force does. In addition, sequence mining algorithm is proposed to find out sequential relationships among frequent items. The datasets of the mushrooms, chess, c10d20k [26] are run on Hadoop by the methods of ours, Emin[21] and Li[15]. The experimental results show that for the mushrooms dataset, our mining time is 48% faster than Emin when the threshold is 0.15 and is 63% faster than Li when the threshold is 0.45. For the chess dataset, our mining time is 22% and 97% faster than Emin and Li when the threshold is 0.55. For the c10d20k dataset, our mining time is 50% and 99% faster than Emin and Li when the threshold is 0.15.
|
author2 |
WUU, LIH-CHYAU |
author_facet |
WUU, LIH-CHYAU WU, I-CHUN 吳亦鈞 |
author |
WU, I-CHUN 吳亦鈞 |
spellingShingle |
WU, I-CHUN 吳亦鈞 Data Mining Based on MapReduce Technology |
author_sort |
WU, I-CHUN |
title |
Data Mining Based on MapReduce Technology |
title_short |
Data Mining Based on MapReduce Technology |
title_full |
Data Mining Based on MapReduce Technology |
title_fullStr |
Data Mining Based on MapReduce Technology |
title_full_unstemmed |
Data Mining Based on MapReduce Technology |
title_sort |
data mining based on mapreduce technology |
publishDate |
2018 |
url |
http://ndltd.ncl.edu.tw/handle/ak4v53 |
work_keys_str_mv |
AT wuichun dataminingbasedonmapreducetechnology AT wúyìjūn dataminingbasedonmapreducetechnology AT wuichun jīyúmapreduceduìyúzīliàotànkānjìshùzhīyánzhì AT wúyìjūn jīyúmapreduceduìyúzīliàotànkānjìshùzhīyánzhì |
_version_ |
1719170532791812096 |