Selecting critical features for data classification based on machine learning methods

Abstract Feature selection becomes prominent, especially in the data sets with many variables and features. It will eliminate unimportant variables and improve the accuracy as well as the performance of classification. Random Forest has emerged as a quite useful algorithm that can handle the feature...

Full description

Bibliographic Details
Main Authors: Rung-Ching Chen, Christine Dewi, Su-Wen Huang, Rezzy Eko Caraka
Format: Article
Language:English
Published: SpringerOpen 2020-07-01
Series:Journal of Big Data
Subjects:
SVM
KNN
LDA
Online Access:http://link.springer.com/article/10.1186/s40537-020-00327-4
id doaj-28fd8afa6ea045ebbe5dd6734f032555
record_format Article
spelling doaj-28fd8afa6ea045ebbe5dd6734f0325552020-11-25T03:54:04ZengSpringerOpenJournal of Big Data2196-11152020-07-017112610.1186/s40537-020-00327-4Selecting critical features for data classification based on machine learning methodsRung-Ching Chen0Christine Dewi1Su-Wen Huang2Rezzy Eko Caraka3Department of Information Management, Chaoyang University of TechnologyDepartment of Information Management, Chaoyang University of TechnologyDepartment of Information Management, Chaoyang University of TechnologyDepartment of Information Management, Chaoyang University of TechnologyAbstract Feature selection becomes prominent, especially in the data sets with many variables and features. It will eliminate unimportant variables and improve the accuracy as well as the performance of classification. Random Forest has emerged as a quite useful algorithm that can handle the feature selection issue even with a higher number of variables. In this paper, we use three popular datasets with a higher number of variables (Bank Marketing, Car Evaluation Database, Human Activity Recognition Using Smartphones) to conduct the experiment. There are four main reasons why feature selection is essential. First, to simplify the model by reducing the number of parameters, next to decrease the training time, to reduce overfilling by enhancing generalization, and to avoid the curse of dimensionality. Besides, we evaluate and compare each accuracy and performance of the classification model, such as Random Forest (RF), Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Linear Discriminant Analysis (LDA). The highest accuracy of the model is the best classifier. Practically, this paper adopts Random Forest to select the important feature in classification. Our experiments clearly show the comparative study of the RF algorithm from different perspectives. Furthermore, we compare the result of the dataset with and without essential features selection by RF methods varImp(), Boruta, and Recursive Feature Elimination (RFE) to get the best percentage accuracy and kappa. Experimental results demonstrate that Random Forest achieves a better performance in all experiment groups.http://link.springer.com/article/10.1186/s40537-020-00327-4Random ForestFeatures selectionSVMClassificationKNNLDA
collection DOAJ
language English
format Article
sources DOAJ
author Rung-Ching Chen
Christine Dewi
Su-Wen Huang
Rezzy Eko Caraka
spellingShingle Rung-Ching Chen
Christine Dewi
Su-Wen Huang
Rezzy Eko Caraka
Selecting critical features for data classification based on machine learning methods
Journal of Big Data
Random Forest
Features selection
SVM
Classification
KNN
LDA
author_facet Rung-Ching Chen
Christine Dewi
Su-Wen Huang
Rezzy Eko Caraka
author_sort Rung-Ching Chen
title Selecting critical features for data classification based on machine learning methods
title_short Selecting critical features for data classification based on machine learning methods
title_full Selecting critical features for data classification based on machine learning methods
title_fullStr Selecting critical features for data classification based on machine learning methods
title_full_unstemmed Selecting critical features for data classification based on machine learning methods
title_sort selecting critical features for data classification based on machine learning methods
publisher SpringerOpen
series Journal of Big Data
issn 2196-1115
publishDate 2020-07-01
description Abstract Feature selection becomes prominent, especially in the data sets with many variables and features. It will eliminate unimportant variables and improve the accuracy as well as the performance of classification. Random Forest has emerged as a quite useful algorithm that can handle the feature selection issue even with a higher number of variables. In this paper, we use three popular datasets with a higher number of variables (Bank Marketing, Car Evaluation Database, Human Activity Recognition Using Smartphones) to conduct the experiment. There are four main reasons why feature selection is essential. First, to simplify the model by reducing the number of parameters, next to decrease the training time, to reduce overfilling by enhancing generalization, and to avoid the curse of dimensionality. Besides, we evaluate and compare each accuracy and performance of the classification model, such as Random Forest (RF), Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Linear Discriminant Analysis (LDA). The highest accuracy of the model is the best classifier. Practically, this paper adopts Random Forest to select the important feature in classification. Our experiments clearly show the comparative study of the RF algorithm from different perspectives. Furthermore, we compare the result of the dataset with and without essential features selection by RF methods varImp(), Boruta, and Recursive Feature Elimination (RFE) to get the best percentage accuracy and kappa. Experimental results demonstrate that Random Forest achieves a better performance in all experiment groups.
topic Random Forest
Features selection
SVM
Classification
KNN
LDA
url http://link.springer.com/article/10.1186/s40537-020-00327-4
work_keys_str_mv AT rungchingchen selectingcriticalfeaturesfordataclassificationbasedonmachinelearningmethods
AT christinedewi selectingcriticalfeaturesfordataclassificationbasedonmachinelearningmethods
AT suwenhuang selectingcriticalfeaturesfordataclassificationbasedonmachinelearningmethods
AT rezzyekocaraka selectingcriticalfeaturesfordataclassificationbasedonmachinelearningmethods
_version_ 1724475017545121792