Semi-Supervised Self-Training Feature Weighted Clustering Decision Tree and Random Forest

A self-training algorithm is an iterative method for semi-supervised learning, which wraps around a base learner. It uses its own predictions to assign labels to unlabeled data. For a self-training algorithm, the classification ability of the base learner and the estimation of prediction confidence...

Full description

Bibliographic Details
Main Authors: Zhenyu Liu, Tao Wen, Wei Sun, Qilong Zhang
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9139499/
id doaj-2edc5885a86d400286c015cd6addbef5
record_format Article
spelling doaj-2edc5885a86d400286c015cd6addbef52021-03-30T04:40:06ZengIEEEIEEE Access2169-35362020-01-01812833712834810.1109/ACCESS.2020.30089519139499Semi-Supervised Self-Training Feature Weighted Clustering Decision Tree and Random ForestZhenyu Liu0https://orcid.org/0000-0002-9705-8377Tao Wen1Wei Sun2Qilong Zhang3College of Computer Science and Engineering, Northeastern University, Shenyang, ChinaCollege of Computer Science and Engineering, Northeastern University, Shenyang, ChinaDepartment of Computer Science and Technology, Dalian Neusoft University of Information, Dalian, ChinaCollege of Computer Science and Engineering, Northeastern University, Shenyang, ChinaA self-training algorithm is an iterative method for semi-supervised learning, which wraps around a base learner. It uses its own predictions to assign labels to unlabeled data. For a self-training algorithm, the classification ability of the base learner and the estimation of prediction confidence are very important. The classical decision tree as the base learner cannot be effective in a self-training algorithm, because it cannot correctly estimate its own predictions. In this paper, we propose a novel method of node split of the decision trees, which uses weighted features to cluster instances. This method is able to combine multiple numerical and categorical features to split nodes. The decision tree and random forest constructed by this method are called FWCDT and FWCRF respectively. FWCDT and FWCRF have the better classification ability than the classical decision trees and forests based on univariate split when the training instances are fewer, therefore, they are more suitable as the base classifiers in self-training. What's more, on the basis of the proposed node-split method, we also respectively explore the suitable prediction confidence measurements for FWCDT and FWCRF. Finally, the results of experiment implemented on the UCI datasets show that the self-training feature weighted clustering decision tree (ST-FWCDT) and random forest (ST-FWCRF) can effectively exploit unlabeled data, and the final obtained classifiers have better generalization ability.https://ieeexplore.ieee.org/document/9139499/Semi-supervised learningself-trainingdecision treerandom forestnode splits
collection DOAJ
language English
format Article
sources DOAJ
author Zhenyu Liu
Tao Wen
Wei Sun
Qilong Zhang
spellingShingle Zhenyu Liu
Tao Wen
Wei Sun
Qilong Zhang
Semi-Supervised Self-Training Feature Weighted Clustering Decision Tree and Random Forest
IEEE Access
Semi-supervised learning
self-training
decision tree
random forest
node splits
author_facet Zhenyu Liu
Tao Wen
Wei Sun
Qilong Zhang
author_sort Zhenyu Liu
title Semi-Supervised Self-Training Feature Weighted Clustering Decision Tree and Random Forest
title_short Semi-Supervised Self-Training Feature Weighted Clustering Decision Tree and Random Forest
title_full Semi-Supervised Self-Training Feature Weighted Clustering Decision Tree and Random Forest
title_fullStr Semi-Supervised Self-Training Feature Weighted Clustering Decision Tree and Random Forest
title_full_unstemmed Semi-Supervised Self-Training Feature Weighted Clustering Decision Tree and Random Forest
title_sort semi-supervised self-training feature weighted clustering decision tree and random forest
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2020-01-01
description A self-training algorithm is an iterative method for semi-supervised learning, which wraps around a base learner. It uses its own predictions to assign labels to unlabeled data. For a self-training algorithm, the classification ability of the base learner and the estimation of prediction confidence are very important. The classical decision tree as the base learner cannot be effective in a self-training algorithm, because it cannot correctly estimate its own predictions. In this paper, we propose a novel method of node split of the decision trees, which uses weighted features to cluster instances. This method is able to combine multiple numerical and categorical features to split nodes. The decision tree and random forest constructed by this method are called FWCDT and FWCRF respectively. FWCDT and FWCRF have the better classification ability than the classical decision trees and forests based on univariate split when the training instances are fewer, therefore, they are more suitable as the base classifiers in self-training. What's more, on the basis of the proposed node-split method, we also respectively explore the suitable prediction confidence measurements for FWCDT and FWCRF. Finally, the results of experiment implemented on the UCI datasets show that the self-training feature weighted clustering decision tree (ST-FWCDT) and random forest (ST-FWCRF) can effectively exploit unlabeled data, and the final obtained classifiers have better generalization ability.
topic Semi-supervised learning
self-training
decision tree
random forest
node splits
url https://ieeexplore.ieee.org/document/9139499/
work_keys_str_mv AT zhenyuliu semisupervisedselftrainingfeatureweightedclusteringdecisiontreeandrandomforest
AT taowen semisupervisedselftrainingfeatureweightedclusteringdecisiontreeandrandomforest
AT weisun semisupervisedselftrainingfeatureweightedclusteringdecisiontreeandrandomforest
AT qilongzhang semisupervisedselftrainingfeatureweightedclusteringdecisiontreeandrandomforest
_version_ 1724181412402167808