On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction
Breast cancer prediction datasets are usually class imbalanced, where the number of data samples in the malignant and benign patient classes are significantly different. Over-sampling techniques can be used to re-balance the datasets to construct more effective prediction models. Moreover, some rela...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-07-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/11/14/6574 |
id |
doaj-f167f08adf074fd88171be9bb15a6a3a |
---|---|
record_format |
Article |
spelling |
doaj-f167f08adf074fd88171be9bb15a6a3a2021-07-23T13:30:05ZengMDPI AGApplied Sciences2076-34172021-07-01116574657410.3390/app11146574On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer PredictionMin-Wei Huang0Chien-Hung Chiu1Chih-Fong Tsai2Wei-Chao Lin3Department of Physical Therapy and Graduate Institute of Rehabilitation Science, China Medical University, Taichung 406040, TaiwanDepartment of Thoracic Surgery, Chang Gung Memorial Hospital, Linkou 333423, TaiwanDepartment of Information Management, National Central University, Taoyuan 320317, TaiwanDepartment of Thoracic Surgery, Chang Gung Memorial Hospital, Linkou 333423, TaiwanBreast cancer prediction datasets are usually class imbalanced, where the number of data samples in the malignant and benign patient classes are significantly different. Over-sampling techniques can be used to re-balance the datasets to construct more effective prediction models. Moreover, some related studies have considered feature selection to remove irrelevant features from the datasets for further performance improvement. However, since the order of combining feature selection and over-sampling can result in different training sets to construct the prediction model, it is unknown which order performs better. In this paper, the information gain (IG) and genetic algorithm (GA) feature selection methods and the synthetic minority over-sampling technique (SMOTE) are used for different combinations. The experimental results based on two breast cancer datasets show that the combination of feature selection and over-sampling outperform the single usage of either feature selection and over-sampling for the highly class imbalanced datasets. In particular, performing IG first and SMOTE second is the better choice. For other datasets with a small class imbalance ratio and a smaller number of features, performing SMOTE is enough to construct an effective prediction model.https://www.mdpi.com/2076-3417/11/14/6574breast cancerdata miningmachine learningfeature selectionover-samplingclass imbalance |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Min-Wei Huang Chien-Hung Chiu Chih-Fong Tsai Wei-Chao Lin |
spellingShingle |
Min-Wei Huang Chien-Hung Chiu Chih-Fong Tsai Wei-Chao Lin On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction Applied Sciences breast cancer data mining machine learning feature selection over-sampling class imbalance |
author_facet |
Min-Wei Huang Chien-Hung Chiu Chih-Fong Tsai Wei-Chao Lin |
author_sort |
Min-Wei Huang |
title |
On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction |
title_short |
On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction |
title_full |
On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction |
title_fullStr |
On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction |
title_full_unstemmed |
On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction |
title_sort |
on combining feature selection and over-sampling techniques for breast cancer prediction |
publisher |
MDPI AG |
series |
Applied Sciences |
issn |
2076-3417 |
publishDate |
2021-07-01 |
description |
Breast cancer prediction datasets are usually class imbalanced, where the number of data samples in the malignant and benign patient classes are significantly different. Over-sampling techniques can be used to re-balance the datasets to construct more effective prediction models. Moreover, some related studies have considered feature selection to remove irrelevant features from the datasets for further performance improvement. However, since the order of combining feature selection and over-sampling can result in different training sets to construct the prediction model, it is unknown which order performs better. In this paper, the information gain (IG) and genetic algorithm (GA) feature selection methods and the synthetic minority over-sampling technique (SMOTE) are used for different combinations. The experimental results based on two breast cancer datasets show that the combination of feature selection and over-sampling outperform the single usage of either feature selection and over-sampling for the highly class imbalanced datasets. In particular, performing IG first and SMOTE second is the better choice. For other datasets with a small class imbalance ratio and a smaller number of features, performing SMOTE is enough to construct an effective prediction model. |
topic |
breast cancer data mining machine learning feature selection over-sampling class imbalance |
url |
https://www.mdpi.com/2076-3417/11/14/6574 |
work_keys_str_mv |
AT minweihuang oncombiningfeatureselectionandoversamplingtechniquesforbreastcancerprediction AT chienhungchiu oncombiningfeatureselectionandoversamplingtechniquesforbreastcancerprediction AT chihfongtsai oncombiningfeatureselectionandoversamplingtechniquesforbreastcancerprediction AT weichaolin oncombiningfeatureselectionandoversamplingtechniquesforbreastcancerprediction |
_version_ |
1721289613376487424 |