Data Mining Based on MapReduce Technology

碩士 === 國立雲林科技大學 === 資訊工程系 === 106 === The widespread use of Internet makes the data easy to share and disseminate, but how to find out important information from massive data becomes an issue. Data mining is one of the technologies to find out important information and it will get faster mining resu...

Full description

Bibliographic Details
Main Authors:	WU, I-CHUN, 吳亦鈞
Other Authors:	WUU, LIH-CHYAU
Format:	Others
Language:	zh-TW
Published:	2018
Online Access:	http://ndltd.ncl.edu.tw/handle/ak4v53

id	ndltd-TW-106YUNT0392035
record_format	oai_dc
spelling	ndltd-TW-106YUNT03920352019-05-16T00:44:54Z http://ndltd.ncl.edu.tw/handle/ak4v53 Data Mining Based on MapReduce Technology 基於MapReduce對於資料探勘技術之研製 WU, I-CHUN 吳亦鈞碩士國立雲林科技大學資訊工程系 106 The widespread use of Internet makes the data easy to share and disseminate, but how to find out important information from massive data becomes an issue. Data mining is one of the technologies to find out important information and it will get faster mining result if more computers are used. Thus, our data mining method is designed to run on Hadoop distributed platform to speed up the mining time. Before mining, each item of a transaction of an original dataset is transformed to 1 or 0; 1 means that the transaction contains the corresponding item and 0 is no. That the transformation of item into binary notation efficiently speeds up the mining work since only simple logic operands to be executed at most time. This paper proposes Brute Force and Candidate Itemset two algorithms to find out the frequent itemsets. Brute Force uses the exhaustive method and Candidate Itemset is based on the Apriori method to prune unnecessary candidate itemsets by the Apriori property not to run all combinations of items that Brute Force does. In addition, sequence mining algorithm is proposed to find out sequential relationships among frequent items. The datasets of the mushrooms, chess, c10d20k [26] are run on Hadoop by the methods of ours, Emin[21] and Li[15]. The experimental results show that for the mushrooms dataset, our mining time is 48% faster than Emin when the threshold is 0.15 and is 63% faster than Li when the threshold is 0.45. For the chess dataset, our mining time is 22% and 97% faster than Emin and Li when the threshold is 0.55. For the c10d20k dataset, our mining time is 50% and 99% faster than Emin and Li when the threshold is 0.15. WUU, LIH-CHYAU 伍麗樵 2018 學位論文 ; thesis 51 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 國立雲林科技大學 === 資訊工程系 === 106 === The widespread use of Internet makes the data easy to share and disseminate, but how to find out important information from massive data becomes an issue. Data mining is one of the technologies to find out important information and it will get faster mining result if more computers are used. Thus, our data mining method is designed to run on Hadoop distributed platform to speed up the mining time. Before mining, each item of a transaction of an original dataset is transformed to 1 or 0; 1 means that the transaction contains the corresponding item and 0 is no. That the transformation of item into binary notation efficiently speeds up the mining work since only simple logic operands to be executed at most time. This paper proposes Brute Force and Candidate Itemset two algorithms to find out the frequent itemsets. Brute Force uses the exhaustive method and Candidate Itemset is based on the Apriori method to prune unnecessary candidate itemsets by the Apriori property not to run all combinations of items that Brute Force does. In addition, sequence mining algorithm is proposed to find out sequential relationships among frequent items. The datasets of the mushrooms, chess, c10d20k [26] are run on Hadoop by the methods of ours, Emin[21] and Li[15]. The experimental results show that for the mushrooms dataset, our mining time is 48% faster than Emin when the threshold is 0.15 and is 63% faster than Li when the threshold is 0.45. For the chess dataset, our mining time is 22% and 97% faster than Emin and Li when the threshold is 0.55. For the c10d20k dataset, our mining time is 50% and 99% faster than Emin and Li when the threshold is 0.15.
author2	WUU, LIH-CHYAU
author_facet	WUU, LIH-CHYAU WU, I-CHUN 吳亦鈞
author	WU, I-CHUN 吳亦鈞
spellingShingle	WU, I-CHUN 吳亦鈞 Data Mining Based on MapReduce Technology
author_sort	WU, I-CHUN
title	Data Mining Based on MapReduce Technology
title_short	Data Mining Based on MapReduce Technology
title_full	Data Mining Based on MapReduce Technology
title_fullStr	Data Mining Based on MapReduce Technology
title_full_unstemmed	Data Mining Based on MapReduce Technology
title_sort	data mining based on mapreduce technology
publishDate	2018
url	http://ndltd.ncl.edu.tw/handle/ak4v53
work_keys_str_mv	AT wuichun dataminingbasedonmapreducetechnology AT wúyìjūn dataminingbasedonmapreducetechnology AT wuichun jīyúmapreduceduìyúzīliàotànkānjìshùzhīyánzhì AT wúyìjūn jīyúmapreduceduìyúzīliàotànkānjìshùzhīyánzhì
_version_	1719170532791812096

Data Mining Based on MapReduce Technology

Similar Items