Stable bagging feature selection on medical data

Abstract In the medical field, distinguishing genes that are relevant to a specific disease, let’s say colon cancer, is crucial to finding a cure and understanding its causes and subsequent complications. Usually, medical datasets are comprised of immensely complex dimensions with considerably small...

Full description

Bibliographic Details
Main Author:	Salem Alelyani
Format:	Article
Language:	English
Published:	SpringerOpen 2021-01-01
Series:	Journal of Big Data
Subjects:	Feature selection Ensemble technique Bagging Dimensionality reduction Medical data Microarray
Online Access:	https://doi.org/10.1186/s40537-020-00385-8

id	doaj-d79c19c80c734f5dbaf23ad643a2b8dd
record_format	Article
spelling	doaj-d79c19c80c734f5dbaf23ad643a2b8dd2021-01-10T13:01:53ZengSpringerOpenJournal of Big Data2196-11152021-01-018111810.1186/s40537-020-00385-8Stable bagging feature selection on medical dataSalem Alelyani0Department of Computer Science, Center for Artificial Intelligence, King Khalid UniversityAbstract In the medical field, distinguishing genes that are relevant to a specific disease, let’s say colon cancer, is crucial to finding a cure and understanding its causes and subsequent complications. Usually, medical datasets are comprised of immensely complex dimensions with considerably small sample size. Thus, for domain experts, such as biologists, the task of identifying these genes have become a very challenging one, to say the least. Feature selection is a technique that aims to select these genes, or features in machine learning field with respect to the disease. However, learning from a medical dataset to identify relevant features suffers from the curse-of-dimensionality. Due to a large number of features with a small sample size, the selection usually returns a different subset each time a new sample is introduced into the dataset. This selection instability is intrinsically related to data variance. We assume that reducing data variance improves selection stability. In this paper, we propose an ensemble approach based on the bagging technique to improve feature selection stability in medical datasets via data variance reduction. We conducted an experiment using four microarray datasets each of which suffers from high dimensionality and relatively small sample size. On each dataset, we applied five well-known feature selection algorithms to select varying number of features. The proposed technique shows a significant improvement in selection stability while at least maintaining the classification accuracy. The stability improvement ranges from 20 to 50 percent in all cases. This implies that the likelihood of selecting the same features increased 20 to 50 percent more. This is accompanied with the increase of classification accuracy in most cases, which signifies the stated results of stability.https://doi.org/10.1186/s40537-020-00385-8Feature selectionEnsemble techniqueBaggingDimensionality reductionMedical dataMicroarray
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Salem Alelyani
spellingShingle	Salem Alelyani Stable bagging feature selection on medical data Journal of Big Data Feature selection Ensemble technique Bagging Dimensionality reduction Medical data Microarray
author_facet	Salem Alelyani
author_sort	Salem Alelyani
title	Stable bagging feature selection on medical data
title_short	Stable bagging feature selection on medical data
title_full	Stable bagging feature selection on medical data
title_fullStr	Stable bagging feature selection on medical data
title_full_unstemmed	Stable bagging feature selection on medical data
title_sort	stable bagging feature selection on medical data
publisher	SpringerOpen
series	Journal of Big Data
issn	2196-1115
publishDate	2021-01-01
description	Abstract In the medical field, distinguishing genes that are relevant to a specific disease, let’s say colon cancer, is crucial to finding a cure and understanding its causes and subsequent complications. Usually, medical datasets are comprised of immensely complex dimensions with considerably small sample size. Thus, for domain experts, such as biologists, the task of identifying these genes have become a very challenging one, to say the least. Feature selection is a technique that aims to select these genes, or features in machine learning field with respect to the disease. However, learning from a medical dataset to identify relevant features suffers from the curse-of-dimensionality. Due to a large number of features with a small sample size, the selection usually returns a different subset each time a new sample is introduced into the dataset. This selection instability is intrinsically related to data variance. We assume that reducing data variance improves selection stability. In this paper, we propose an ensemble approach based on the bagging technique to improve feature selection stability in medical datasets via data variance reduction. We conducted an experiment using four microarray datasets each of which suffers from high dimensionality and relatively small sample size. On each dataset, we applied five well-known feature selection algorithms to select varying number of features. The proposed technique shows a significant improvement in selection stability while at least maintaining the classification accuracy. The stability improvement ranges from 20 to 50 percent in all cases. This implies that the likelihood of selecting the same features increased 20 to 50 percent more. This is accompanied with the increase of classification accuracy in most cases, which signifies the stated results of stability.
topic	Feature selection Ensemble technique Bagging Dimensionality reduction Medical data Microarray
url	https://doi.org/10.1186/s40537-020-00385-8
work_keys_str_mv	AT salemalelyani stablebaggingfeatureselectiononmedicaldata
_version_	1724341920970309632

Stable bagging feature selection on medical data

Similar Items