Parallel Decision Tree Construction Using Attribute-Key Classification

碩士 === 國立中央大學 === 資訊工程學系 === 101 === In machine learning, a classifier is able to associate new observed instances to a set of categories, on the basis of a training set of data containing instances whose category membership is known. Decision-tree-based algorithms are popular classifiers in many ap...

Full description

Bibliographic Details
Main Authors:	Shiau-Rung Tsui, 崔孝戎
Other Authors:	Wei-Jen Wang
Format:	Others
Language:	zh-TW
Published:	2013
Online Access:	http://ndltd.ncl.edu.tw/handle/98739019787232838356

id	ndltd-TW-101NCU05392130
record_format	oai_dc
spelling	ndltd-TW-101NCU053921302015-10-13T22:34:51Z http://ndltd.ncl.edu.tw/handle/98739019787232838356 Parallel Decision Tree Construction Using Attribute-Key Classification 平行式關鍵屬性區別決策樹演算法 Shiau-Rung Tsui 崔孝戎碩士國立中央大學資訊工程學系 101 In machine learning, a classifier is able to associate new observed instances to a set of categories, on the basis of a training set of data containing instances whose category membership is known. Decision-tree-based algorithms are popular classifiers in many application domains because they are simple to understand and interpret. To construct a decision tree, a typical decision-tree-based algorithm chooses the attribute of the training dataset that most effectively splits the training dataset into subsets enriched in one class or the other, and then iteratively splits the subsets until the instances of the subsets belong to one class. There are two popular approaches to split the training dataset into subsets: searching the support confidence of classes and using the information gain. A common weakness of the two splitting methods is that, they may not achieve good performance for a dataset with several co-related attributes. To overcome this problem, we propose a new decision-tree algorithm based on the key values of an attribute. In our algorithm, each key value of an attribute is used to split a subset, of which the instances all have the same key value; the intersection of every pair of the key-value subsets is empty; the remainder instances that are hard to distinguish are put in one subset. The proposed algorithm automatically detects the best attribute that can distinguish different class sets, instead of classes, and uses the attribute to split the dataset level by level. The proposed algorithm improves the correctness of data classification and reduces the space requirement for storing the nodes of a decision tree. The experiment results show that, the proposed algorithm is generally better than existing decision-tree-based classifiers such as C4.5 in classification accuracy. Moreover, it is generally better than other types of classification algorithms, such as SVM, logistic regression, and Naïve Bayes, in classification accuracy. To handle big data, we also proposed two new parallel algorithms based on the proposed decision tree algorithm: One is implemented in MPI and the other is in Hadoop MapReduce. In the MPI implementation, we design a heuristic workload balancer based on the EDF scheduling algorithm to balance the workload among all computing hosts, in order to shorten the total execution time. In the Hadoop implementation, we use an attributed-based parallelization strategy to shorten the total execution time. Both parallel implementations show good scalability according to our experiments. The MPI implementation generally has shorter execution time than the Hadoop implementation in our experiments. However, the Hadoop implementation outperforms the MPI implementation for the datasets with a relatively large number of attributes. Wei-Jen Wang 王尉任 2013 學位論文 ; thesis 55 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 國立中央大學 === 資訊工程學系 === 101 === In machine learning, a classifier is able to associate new observed instances to a set of categories, on the basis of a training set of data containing instances whose category membership is known. Decision-tree-based algorithms are popular classifiers in many application domains because they are simple to understand and interpret. To construct a decision tree, a typical decision-tree-based algorithm chooses the attribute of the training dataset that most effectively splits the training dataset into subsets enriched in one class or the other, and then iteratively splits the subsets until the instances of the subsets belong to one class. There are two popular approaches to split the training dataset into subsets: searching the support confidence of classes and using the information gain. A common weakness of the two splitting methods is that, they may not achieve good performance for a dataset with several co-related attributes. To overcome this problem, we propose a new decision-tree algorithm based on the key values of an attribute. In our algorithm, each key value of an attribute is used to split a subset, of which the instances all have the same key value; the intersection of every pair of the key-value subsets is empty; the remainder instances that are hard to distinguish are put in one subset. The proposed algorithm automatically detects the best attribute that can distinguish different class sets, instead of classes, and uses the attribute to split the dataset level by level. The proposed algorithm improves the correctness of data classification and reduces the space requirement for storing the nodes of a decision tree. The experiment results show that, the proposed algorithm is generally better than existing decision-tree-based classifiers such as C4.5 in classification accuracy. Moreover, it is generally better than other types of classification algorithms, such as SVM, logistic regression, and Naïve Bayes, in classification accuracy. To handle big data, we also proposed two new parallel algorithms based on the proposed decision tree algorithm: One is implemented in MPI and the other is in Hadoop MapReduce. In the MPI implementation, we design a heuristic workload balancer based on the EDF scheduling algorithm to balance the workload among all computing hosts, in order to shorten the total execution time. In the Hadoop implementation, we use an attributed-based parallelization strategy to shorten the total execution time. Both parallel implementations show good scalability according to our experiments. The MPI implementation generally has shorter execution time than the Hadoop implementation in our experiments. However, the Hadoop implementation outperforms the MPI implementation for the datasets with a relatively large number of attributes.
author2	Wei-Jen Wang
author_facet	Wei-Jen Wang Shiau-Rung Tsui 崔孝戎
author	Shiau-Rung Tsui 崔孝戎
spellingShingle	Shiau-Rung Tsui 崔孝戎 Parallel Decision Tree Construction Using Attribute-Key Classification
author_sort	Shiau-Rung Tsui
title	Parallel Decision Tree Construction Using Attribute-Key Classification
title_short	Parallel Decision Tree Construction Using Attribute-Key Classification
title_full	Parallel Decision Tree Construction Using Attribute-Key Classification
title_fullStr	Parallel Decision Tree Construction Using Attribute-Key Classification
title_full_unstemmed	Parallel Decision Tree Construction Using Attribute-Key Classification
title_sort	parallel decision tree construction using attribute-key classification
publishDate	2013
url	http://ndltd.ncl.edu.tw/handle/98739019787232838356
work_keys_str_mv	AT shiaurungtsui paralleldecisiontreeconstructionusingattributekeyclassification AT cuīxiàoróng paralleldecisiontreeconstructionusingattributekeyclassification AT shiaurungtsui píngxíngshìguānjiànshǔxìngqūbiéjuécèshùyǎnsuànfǎ AT cuīxiàoróng píngxíngshìguānjiànshǔxìngqūbiéjuécèshùyǎnsuànfǎ
_version_	1718078096010117120

Parallel Decision Tree Construction Using Attribute-Key Classification

Similar Items