Selecting critical features for data classification based on machine learning methods

Abstract Feature selection becomes prominent, especially in the data sets with many variables and features. It will eliminate unimportant variables and improve the accuracy as well as the performance of classification. Random Forest has emerged as a quite useful algorithm that can handle the feature...

Full description

Bibliographic Details
Main Authors:	Rung-Ching Chen, Christine Dewi, Su-Wen Huang, Rezzy Eko Caraka
Format:	Article
Language:	English
Published:	SpringerOpen 2020-07-01
Series:	Journal of Big Data
Subjects:	Random Forest Features selection SVM Classification KNN LDA
Online Access:	http://link.springer.com/article/10.1186/s40537-020-00327-4

id	doaj-28fd8afa6ea045ebbe5dd6734f032555
record_format	Article
spelling	doaj-28fd8afa6ea045ebbe5dd6734f0325552020-11-25T03:54:04ZengSpringerOpenJournal of Big Data2196-11152020-07-017112610.1186/s40537-020-00327-4Selecting critical features for data classification based on machine learning methodsRung-Ching Chen0Christine Dewi1Su-Wen Huang2Rezzy Eko Caraka3Department of Information Management, Chaoyang University of TechnologyDepartment of Information Management, Chaoyang University of TechnologyDepartment of Information Management, Chaoyang University of TechnologyDepartment of Information Management, Chaoyang University of TechnologyAbstract Feature selection becomes prominent, especially in the data sets with many variables and features. It will eliminate unimportant variables and improve the accuracy as well as the performance of classification. Random Forest has emerged as a quite useful algorithm that can handle the feature selection issue even with a higher number of variables. In this paper, we use three popular datasets with a higher number of variables (Bank Marketing, Car Evaluation Database, Human Activity Recognition Using Smartphones) to conduct the experiment. There are four main reasons why feature selection is essential. First, to simplify the model by reducing the number of parameters, next to decrease the training time, to reduce overfilling by enhancing generalization, and to avoid the curse of dimensionality. Besides, we evaluate and compare each accuracy and performance of the classification model, such as Random Forest (RF), Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Linear Discriminant Analysis (LDA). The highest accuracy of the model is the best classifier. Practically, this paper adopts Random Forest to select the important feature in classification. Our experiments clearly show the comparative study of the RF algorithm from different perspectives. Furthermore, we compare the result of the dataset with and without essential features selection by RF methods varImp(), Boruta, and Recursive Feature Elimination (RFE) to get the best percentage accuracy and kappa. Experimental results demonstrate that Random Forest achieves a better performance in all experiment groups.http://link.springer.com/article/10.1186/s40537-020-00327-4Random ForestFeatures selectionSVMClassificationKNNLDA
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Rung-Ching Chen Christine Dewi Su-Wen Huang Rezzy Eko Caraka
spellingShingle	Rung-Ching Chen Christine Dewi Su-Wen Huang Rezzy Eko Caraka Selecting critical features for data classification based on machine learning methods Journal of Big Data Random Forest Features selection SVM Classification KNN LDA
author_facet	Rung-Ching Chen Christine Dewi Su-Wen Huang Rezzy Eko Caraka
author_sort	Rung-Ching Chen
title	Selecting critical features for data classification based on machine learning methods
title_short	Selecting critical features for data classification based on machine learning methods
title_full	Selecting critical features for data classification based on machine learning methods
title_fullStr	Selecting critical features for data classification based on machine learning methods
title_full_unstemmed	Selecting critical features for data classification based on machine learning methods
title_sort	selecting critical features for data classification based on machine learning methods
publisher	SpringerOpen
series	Journal of Big Data
issn	2196-1115
publishDate	2020-07-01
description	Abstract Feature selection becomes prominent, especially in the data sets with many variables and features. It will eliminate unimportant variables and improve the accuracy as well as the performance of classification. Random Forest has emerged as a quite useful algorithm that can handle the feature selection issue even with a higher number of variables. In this paper, we use three popular datasets with a higher number of variables (Bank Marketing, Car Evaluation Database, Human Activity Recognition Using Smartphones) to conduct the experiment. There are four main reasons why feature selection is essential. First, to simplify the model by reducing the number of parameters, next to decrease the training time, to reduce overfilling by enhancing generalization, and to avoid the curse of dimensionality. Besides, we evaluate and compare each accuracy and performance of the classification model, such as Random Forest (RF), Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Linear Discriminant Analysis (LDA). The highest accuracy of the model is the best classifier. Practically, this paper adopts Random Forest to select the important feature in classification. Our experiments clearly show the comparative study of the RF algorithm from different perspectives. Furthermore, we compare the result of the dataset with and without essential features selection by RF methods varImp(), Boruta, and Recursive Feature Elimination (RFE) to get the best percentage accuracy and kappa. Experimental results demonstrate that Random Forest achieves a better performance in all experiment groups.
topic	Random Forest Features selection SVM Classification KNN LDA
url	http://link.springer.com/article/10.1186/s40537-020-00327-4
work_keys_str_mv	AT rungchingchen selectingcriticalfeaturesfordataclassificationbasedonmachinelearningmethods AT christinedewi selectingcriticalfeaturesfordataclassificationbasedonmachinelearningmethods AT suwenhuang selectingcriticalfeaturesfordataclassificationbasedonmachinelearningmethods AT rezzyekocaraka selectingcriticalfeaturesfordataclassificationbasedonmachinelearningmethods
_version_	1724475017545121792

Selecting critical features for data classification based on machine learning methods

Similar Items