On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction

Breast cancer prediction datasets are usually class imbalanced, where the number of data samples in the malignant and benign patient classes are significantly different. Over-sampling techniques can be used to re-balance the datasets to construct more effective prediction models. Moreover, some rela...

Full description

Bibliographic Details
Main Authors:	Min-Wei Huang, Chien-Hung Chiu, Chih-Fong Tsai, Wei-Chao Lin
Format:	Article
Language:	English
Published:	MDPI AG 2021-07-01
Series:	Applied Sciences
Subjects:	breast cancer data mining machine learning feature selection over-sampling class imbalance
Online Access:	https://www.mdpi.com/2076-3417/11/14/6574

id	doaj-f167f08adf074fd88171be9bb15a6a3a
record_format	Article
spelling	doaj-f167f08adf074fd88171be9bb15a6a3a2021-07-23T13:30:05ZengMDPI AGApplied Sciences2076-34172021-07-01116574657410.3390/app11146574On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer PredictionMin-Wei Huang0Chien-Hung Chiu1Chih-Fong Tsai2Wei-Chao Lin3Department of Physical Therapy and Graduate Institute of Rehabilitation Science, China Medical University, Taichung 406040, TaiwanDepartment of Thoracic Surgery, Chang Gung Memorial Hospital, Linkou 333423, TaiwanDepartment of Information Management, National Central University, Taoyuan 320317, TaiwanDepartment of Thoracic Surgery, Chang Gung Memorial Hospital, Linkou 333423, TaiwanBreast cancer prediction datasets are usually class imbalanced, where the number of data samples in the malignant and benign patient classes are significantly different. Over-sampling techniques can be used to re-balance the datasets to construct more effective prediction models. Moreover, some related studies have considered feature selection to remove irrelevant features from the datasets for further performance improvement. However, since the order of combining feature selection and over-sampling can result in different training sets to construct the prediction model, it is unknown which order performs better. In this paper, the information gain (IG) and genetic algorithm (GA) feature selection methods and the synthetic minority over-sampling technique (SMOTE) are used for different combinations. The experimental results based on two breast cancer datasets show that the combination of feature selection and over-sampling outperform the single usage of either feature selection and over-sampling for the highly class imbalanced datasets. In particular, performing IG first and SMOTE second is the better choice. For other datasets with a small class imbalance ratio and a smaller number of features, performing SMOTE is enough to construct an effective prediction model.https://www.mdpi.com/2076-3417/11/14/6574breast cancerdata miningmachine learningfeature selectionover-samplingclass imbalance
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Min-Wei Huang Chien-Hung Chiu Chih-Fong Tsai Wei-Chao Lin
spellingShingle	Min-Wei Huang Chien-Hung Chiu Chih-Fong Tsai Wei-Chao Lin On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction Applied Sciences breast cancer data mining machine learning feature selection over-sampling class imbalance
author_facet	Min-Wei Huang Chien-Hung Chiu Chih-Fong Tsai Wei-Chao Lin
author_sort	Min-Wei Huang
title	On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction
title_short	On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction
title_full	On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction
title_fullStr	On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction
title_full_unstemmed	On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction
title_sort	on combining feature selection and over-sampling techniques for breast cancer prediction
publisher	MDPI AG
series	Applied Sciences
issn	2076-3417
publishDate	2021-07-01
description	Breast cancer prediction datasets are usually class imbalanced, where the number of data samples in the malignant and benign patient classes are significantly different. Over-sampling techniques can be used to re-balance the datasets to construct more effective prediction models. Moreover, some related studies have considered feature selection to remove irrelevant features from the datasets for further performance improvement. However, since the order of combining feature selection and over-sampling can result in different training sets to construct the prediction model, it is unknown which order performs better. In this paper, the information gain (IG) and genetic algorithm (GA) feature selection methods and the synthetic minority over-sampling technique (SMOTE) are used for different combinations. The experimental results based on two breast cancer datasets show that the combination of feature selection and over-sampling outperform the single usage of either feature selection and over-sampling for the highly class imbalanced datasets. In particular, performing IG first and SMOTE second is the better choice. For other datasets with a small class imbalance ratio and a smaller number of features, performing SMOTE is enough to construct an effective prediction model.
topic	breast cancer data mining machine learning feature selection over-sampling class imbalance
url	https://www.mdpi.com/2076-3417/11/14/6574
work_keys_str_mv	AT minweihuang oncombiningfeatureselectionandoversamplingtechniquesforbreastcancerprediction AT chienhungchiu oncombiningfeatureselectionandoversamplingtechniquesforbreastcancerprediction AT chihfongtsai oncombiningfeatureselectionandoversamplingtechniquesforbreastcancerprediction AT weichaolin oncombiningfeatureselectionandoversamplingtechniquesforbreastcancerprediction
_version_	1721289613376487424

On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction

Similar Items