Parallel Decision Tree Construction Using Attribute-Key Classification

碩士 === 國立中央大學 === 資訊工程學系 === 101 === In machine learning, a classifier is able to associate new observed instances to a set of categories, on the basis of a training set of data containing instances whose category membership is known. Decision-tree-based algorithms are popular classifiers in many ap...

Full description

Bibliographic Details
Main Authors: Shiau-Rung Tsui, 崔孝戎
Other Authors: Wei-Jen Wang
Format: Others
Language:zh-TW
Published: 2013
Online Access:http://ndltd.ncl.edu.tw/handle/98739019787232838356
id ndltd-TW-101NCU05392130
record_format oai_dc
spelling ndltd-TW-101NCU053921302015-10-13T22:34:51Z http://ndltd.ncl.edu.tw/handle/98739019787232838356 Parallel Decision Tree Construction Using Attribute-Key Classification 平行式關鍵屬性區別決策樹演算法 Shiau-Rung Tsui 崔孝戎 碩士 國立中央大學 資訊工程學系 101 In machine learning, a classifier is able to associate new observed instances to a set of categories, on the basis of a training set of data containing instances whose category membership is known. Decision-tree-based algorithms are popular classifiers in many application domains because they are simple to understand and interpret. To construct a decision tree, a typical decision-tree-based algorithm chooses the attribute of the training dataset that most effectively splits the training dataset into subsets enriched in one class or the other, and then iteratively splits the subsets until the instances of the subsets belong to one class. There are two popular approaches to split the training dataset into subsets: searching the support confidence of classes and using the information gain. A common weakness of the two splitting methods is that, they may not achieve good performance for a dataset with several co-related attributes. To overcome this problem, we propose a new decision-tree algorithm based on the key values of an attribute. In our algorithm, each key value of an attribute is used to split a subset, of which the instances all have the same key value; the intersection of every pair of the key-value subsets is empty; the remainder instances that are hard to distinguish are put in one subset. The proposed algorithm automatically detects the best attribute that can distinguish different class sets, instead of classes, and uses the attribute to split the dataset level by level. The proposed algorithm improves the correctness of data classification and reduces the space requirement for storing the nodes of a decision tree. The experiment results show that, the proposed algorithm is generally better than existing decision-tree-based classifiers such as C4.5 in classification accuracy. Moreover, it is generally better than other types of classification algorithms, such as SVM, logistic regression, and Naïve Bayes, in classification accuracy. To handle big data, we also proposed two new parallel algorithms based on the proposed decision tree algorithm: One is implemented in MPI and the other is in Hadoop MapReduce. In the MPI implementation, we design a heuristic workload balancer based on the EDF scheduling algorithm to balance the workload among all computing hosts, in order to shorten the total execution time. In the Hadoop implementation, we use an attributed-based parallelization strategy to shorten the total execution time. Both parallel implementations show good scalability according to our experiments. The MPI implementation generally has shorter execution time than the Hadoop implementation in our experiments. However, the Hadoop implementation outperforms the MPI implementation for the datasets with a relatively large number of attributes. Wei-Jen Wang 王尉任 2013 學位論文 ; thesis 55 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立中央大學 === 資訊工程學系 === 101 === In machine learning, a classifier is able to associate new observed instances to a set of categories, on the basis of a training set of data containing instances whose category membership is known. Decision-tree-based algorithms are popular classifiers in many application domains because they are simple to understand and interpret. To construct a decision tree, a typical decision-tree-based algorithm chooses the attribute of the training dataset that most effectively splits the training dataset into subsets enriched in one class or the other, and then iteratively splits the subsets until the instances of the subsets belong to one class. There are two popular approaches to split the training dataset into subsets: searching the support confidence of classes and using the information gain. A common weakness of the two splitting methods is that, they may not achieve good performance for a dataset with several co-related attributes. To overcome this problem, we propose a new decision-tree algorithm based on the key values of an attribute. In our algorithm, each key value of an attribute is used to split a subset, of which the instances all have the same key value; the intersection of every pair of the key-value subsets is empty; the remainder instances that are hard to distinguish are put in one subset. The proposed algorithm automatically detects the best attribute that can distinguish different class sets, instead of classes, and uses the attribute to split the dataset level by level. The proposed algorithm improves the correctness of data classification and reduces the space requirement for storing the nodes of a decision tree. The experiment results show that, the proposed algorithm is generally better than existing decision-tree-based classifiers such as C4.5 in classification accuracy. Moreover, it is generally better than other types of classification algorithms, such as SVM, logistic regression, and Naïve Bayes, in classification accuracy. To handle big data, we also proposed two new parallel algorithms based on the proposed decision tree algorithm: One is implemented in MPI and the other is in Hadoop MapReduce. In the MPI implementation, we design a heuristic workload balancer based on the EDF scheduling algorithm to balance the workload among all computing hosts, in order to shorten the total execution time. In the Hadoop implementation, we use an attributed-based parallelization strategy to shorten the total execution time. Both parallel implementations show good scalability according to our experiments. The MPI implementation generally has shorter execution time than the Hadoop implementation in our experiments. However, the Hadoop implementation outperforms the MPI implementation for the datasets with a relatively large number of attributes.
author2 Wei-Jen Wang
author_facet Wei-Jen Wang
Shiau-Rung Tsui
崔孝戎
author Shiau-Rung Tsui
崔孝戎
spellingShingle Shiau-Rung Tsui
崔孝戎
Parallel Decision Tree Construction Using Attribute-Key Classification
author_sort Shiau-Rung Tsui
title Parallel Decision Tree Construction Using Attribute-Key Classification
title_short Parallel Decision Tree Construction Using Attribute-Key Classification
title_full Parallel Decision Tree Construction Using Attribute-Key Classification
title_fullStr Parallel Decision Tree Construction Using Attribute-Key Classification
title_full_unstemmed Parallel Decision Tree Construction Using Attribute-Key Classification
title_sort parallel decision tree construction using attribute-key classification
publishDate 2013
url http://ndltd.ncl.edu.tw/handle/98739019787232838356
work_keys_str_mv AT shiaurungtsui paralleldecisiontreeconstructionusingattributekeyclassification
AT cuīxiàoróng paralleldecisiontreeconstructionusingattributekeyclassification
AT shiaurungtsui píngxíngshìguānjiànshǔxìngqūbiéjuécèshùyǎnsuànfǎ
AT cuīxiàoróng píngxíngshìguānjiànshǔxìngqūbiéjuécèshùyǎnsuànfǎ
_version_ 1718078096010117120