Summary: | 碩士 === 國立中央大學 === 資訊工程學系 === 101 === In machine learning, a classifier is able to associate new observed instances to a set of categories, on the basis of a training set of data containing instances whose category membership is known. Decision-tree-based algorithms are popular classifiers in many application domains because they are simple to understand and interpret. To construct a decision tree, a typical decision-tree-based algorithm chooses the attribute of the training dataset that most effectively splits the training dataset into subsets enriched in one class or the other, and then iteratively splits the subsets until the instances of the subsets belong to one class. There are two popular approaches to split the training dataset into subsets: searching the support confidence of classes and using the information gain. A common weakness of the two splitting methods is that, they may not achieve good performance for a dataset with several co-related attributes. To overcome this problem, we propose a new decision-tree algorithm based on the key values of an attribute. In our algorithm, each key value of an attribute is used to split a subset, of which the instances all have the same key value; the intersection of every pair of the key-value subsets is empty; the remainder instances that are hard to distinguish are put in one subset. The proposed algorithm automatically detects the best attribute that can distinguish different class sets, instead of classes, and uses the attribute to split the dataset level by level. The proposed algorithm improves the correctness of data classification and reduces the space requirement for storing the nodes of a decision tree. The experiment results show that, the proposed algorithm is generally better than existing decision-tree-based classifiers such as C4.5 in classification accuracy. Moreover, it is generally better than other types of classification algorithms, such as SVM, logistic regression, and Naïve Bayes, in classification accuracy.
To handle big data, we also proposed two new parallel algorithms based on the proposed decision tree algorithm: One is implemented in MPI and the other is in Hadoop MapReduce. In the MPI implementation, we design a heuristic workload balancer based on the EDF scheduling algorithm to balance the workload among all computing hosts, in order to shorten the total execution time. In the Hadoop implementation, we use an attributed-based parallelization strategy to shorten the total execution time. Both parallel implementations show good scalability according to our experiments. The MPI implementation generally has shorter execution time than the Hadoop implementation in our experiments. However, the Hadoop implementation outperforms the MPI implementation for the datasets with a relatively large number of attributes.
|