On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction

Breast cancer prediction datasets are usually class imbalanced, where the number of data samples in the malignant and benign patient classes are significantly different. Over-sampling techniques can be used to re-balance the datasets to construct more effective prediction models. Moreover, some rela...

Full description

Bibliographic Details
Main Authors: Min-Wei Huang, Chien-Hung Chiu, Chih-Fong Tsai, Wei-Chao Lin
Format: Article
Language:English
Published: MDPI AG 2021-07-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/14/6574
id doaj-f167f08adf074fd88171be9bb15a6a3a
record_format Article
spelling doaj-f167f08adf074fd88171be9bb15a6a3a2021-07-23T13:30:05ZengMDPI AGApplied Sciences2076-34172021-07-01116574657410.3390/app11146574On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer PredictionMin-Wei Huang0Chien-Hung Chiu1Chih-Fong Tsai2Wei-Chao Lin3Department of Physical Therapy and Graduate Institute of Rehabilitation Science, China Medical University, Taichung 406040, TaiwanDepartment of Thoracic Surgery, Chang Gung Memorial Hospital, Linkou 333423, TaiwanDepartment of Information Management, National Central University, Taoyuan 320317, TaiwanDepartment of Thoracic Surgery, Chang Gung Memorial Hospital, Linkou 333423, TaiwanBreast cancer prediction datasets are usually class imbalanced, where the number of data samples in the malignant and benign patient classes are significantly different. Over-sampling techniques can be used to re-balance the datasets to construct more effective prediction models. Moreover, some related studies have considered feature selection to remove irrelevant features from the datasets for further performance improvement. However, since the order of combining feature selection and over-sampling can result in different training sets to construct the prediction model, it is unknown which order performs better. In this paper, the information gain (IG) and genetic algorithm (GA) feature selection methods and the synthetic minority over-sampling technique (SMOTE) are used for different combinations. The experimental results based on two breast cancer datasets show that the combination of feature selection and over-sampling outperform the single usage of either feature selection and over-sampling for the highly class imbalanced datasets. In particular, performing IG first and SMOTE second is the better choice. For other datasets with a small class imbalance ratio and a smaller number of features, performing SMOTE is enough to construct an effective prediction model.https://www.mdpi.com/2076-3417/11/14/6574breast cancerdata miningmachine learningfeature selectionover-samplingclass imbalance
collection DOAJ
language English
format Article
sources DOAJ
author Min-Wei Huang
Chien-Hung Chiu
Chih-Fong Tsai
Wei-Chao Lin
spellingShingle Min-Wei Huang
Chien-Hung Chiu
Chih-Fong Tsai
Wei-Chao Lin
On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction
Applied Sciences
breast cancer
data mining
machine learning
feature selection
over-sampling
class imbalance
author_facet Min-Wei Huang
Chien-Hung Chiu
Chih-Fong Tsai
Wei-Chao Lin
author_sort Min-Wei Huang
title On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction
title_short On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction
title_full On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction
title_fullStr On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction
title_full_unstemmed On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction
title_sort on combining feature selection and over-sampling techniques for breast cancer prediction
publisher MDPI AG
series Applied Sciences
issn 2076-3417
publishDate 2021-07-01
description Breast cancer prediction datasets are usually class imbalanced, where the number of data samples in the malignant and benign patient classes are significantly different. Over-sampling techniques can be used to re-balance the datasets to construct more effective prediction models. Moreover, some related studies have considered feature selection to remove irrelevant features from the datasets for further performance improvement. However, since the order of combining feature selection and over-sampling can result in different training sets to construct the prediction model, it is unknown which order performs better. In this paper, the information gain (IG) and genetic algorithm (GA) feature selection methods and the synthetic minority over-sampling technique (SMOTE) are used for different combinations. The experimental results based on two breast cancer datasets show that the combination of feature selection and over-sampling outperform the single usage of either feature selection and over-sampling for the highly class imbalanced datasets. In particular, performing IG first and SMOTE second is the better choice. For other datasets with a small class imbalance ratio and a smaller number of features, performing SMOTE is enough to construct an effective prediction model.
topic breast cancer
data mining
machine learning
feature selection
over-sampling
class imbalance
url https://www.mdpi.com/2076-3417/11/14/6574
work_keys_str_mv AT minweihuang oncombiningfeatureselectionandoversamplingtechniquesforbreastcancerprediction
AT chienhungchiu oncombiningfeatureselectionandoversamplingtechniquesforbreastcancerprediction
AT chihfongtsai oncombiningfeatureselectionandoversamplingtechniquesforbreastcancerprediction
AT weichaolin oncombiningfeatureselectionandoversamplingtechniquesforbreastcancerprediction
_version_ 1721289613376487424