Data Mining Based on MapReduce Technology

碩士 === 國立雲林科技大學 === 資訊工程系 === 106 === The widespread use of Internet makes the data easy to share and disseminate, but how to find out important information from massive data becomes an issue. Data mining is one of the technologies to find out important information and it will get faster mining resu...

Full description

Bibliographic Details
Main Authors: WU, I-CHUN, 吳亦鈞
Other Authors: WUU, LIH-CHYAU
Format: Others
Language:zh-TW
Published: 2018
Online Access:http://ndltd.ncl.edu.tw/handle/ak4v53
id ndltd-TW-106YUNT0392035
record_format oai_dc
spelling ndltd-TW-106YUNT03920352019-05-16T00:44:54Z http://ndltd.ncl.edu.tw/handle/ak4v53 Data Mining Based on MapReduce Technology 基於MapReduce對於資料探勘技術之研製 WU, I-CHUN 吳亦鈞 碩士 國立雲林科技大學 資訊工程系 106 The widespread use of Internet makes the data easy to share and disseminate, but how to find out important information from massive data becomes an issue. Data mining is one of the technologies to find out important information and it will get faster mining result if more computers are used. Thus, our data mining method is designed to run on Hadoop distributed platform to speed up the mining time. Before mining, each item of a transaction of an original dataset is transformed to 1 or 0; 1 means that the transaction contains the corresponding item and 0 is no. That the transformation of item into binary notation efficiently speeds up the mining work since only simple logic operands to be executed at most time. This paper proposes Brute Force and Candidate Itemset two algorithms to find out the frequent itemsets. Brute Force uses the exhaustive method and Candidate Itemset is based on the Apriori method to prune unnecessary candidate itemsets by the Apriori property not to run all combinations of items that Brute Force does. In addition, sequence mining algorithm is proposed to find out sequential relationships among frequent items. The datasets of the mushrooms, chess, c10d20k [26] are run on Hadoop by the methods of ours, Emin[21] and Li[15]. The experimental results show that for the mushrooms dataset, our mining time is 48% faster than Emin when the threshold is 0.15 and is 63% faster than Li when the threshold is 0.45. For the chess dataset, our mining time is 22% and 97% faster than Emin and Li when the threshold is 0.55. For the c10d20k dataset, our mining time is 50% and 99% faster than Emin and Li when the threshold is 0.15. WUU, LIH-CHYAU 伍麗樵 2018 學位論文 ; thesis 51 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立雲林科技大學 === 資訊工程系 === 106 === The widespread use of Internet makes the data easy to share and disseminate, but how to find out important information from massive data becomes an issue. Data mining is one of the technologies to find out important information and it will get faster mining result if more computers are used. Thus, our data mining method is designed to run on Hadoop distributed platform to speed up the mining time. Before mining, each item of a transaction of an original dataset is transformed to 1 or 0; 1 means that the transaction contains the corresponding item and 0 is no. That the transformation of item into binary notation efficiently speeds up the mining work since only simple logic operands to be executed at most time. This paper proposes Brute Force and Candidate Itemset two algorithms to find out the frequent itemsets. Brute Force uses the exhaustive method and Candidate Itemset is based on the Apriori method to prune unnecessary candidate itemsets by the Apriori property not to run all combinations of items that Brute Force does. In addition, sequence mining algorithm is proposed to find out sequential relationships among frequent items. The datasets of the mushrooms, chess, c10d20k [26] are run on Hadoop by the methods of ours, Emin[21] and Li[15]. The experimental results show that for the mushrooms dataset, our mining time is 48% faster than Emin when the threshold is 0.15 and is 63% faster than Li when the threshold is 0.45. For the chess dataset, our mining time is 22% and 97% faster than Emin and Li when the threshold is 0.55. For the c10d20k dataset, our mining time is 50% and 99% faster than Emin and Li when the threshold is 0.15.
author2 WUU, LIH-CHYAU
author_facet WUU, LIH-CHYAU
WU, I-CHUN
吳亦鈞
author WU, I-CHUN
吳亦鈞
spellingShingle WU, I-CHUN
吳亦鈞
Data Mining Based on MapReduce Technology
author_sort WU, I-CHUN
title Data Mining Based on MapReduce Technology
title_short Data Mining Based on MapReduce Technology
title_full Data Mining Based on MapReduce Technology
title_fullStr Data Mining Based on MapReduce Technology
title_full_unstemmed Data Mining Based on MapReduce Technology
title_sort data mining based on mapreduce technology
publishDate 2018
url http://ndltd.ncl.edu.tw/handle/ak4v53
work_keys_str_mv AT wuichun dataminingbasedonmapreducetechnology
AT wúyìjūn dataminingbasedonmapreducetechnology
AT wuichun jīyúmapreduceduìyúzīliàotànkānjìshùzhīyánzhì
AT wúyìjūn jīyúmapreduceduìyúzīliàotànkānjìshùzhīyánzhì
_version_ 1719170532791812096