A Parameter-Free Cleaning Method for SMOTE in Imbalanced Classification

Oversampling is an efficient technique in dealing with class-imbalance problem. It addresses the problem by reduplicating or generating the minority class samples to balance the distribution between the samples of the majority and the minority class. Synthetic minority oversampling technique (SMOTE)...

Full description

Bibliographic Details
Main Authors: Yuanting Yan, Ruiqing Liu, Zihan Ding, Xiuquan Du, Jie Chen, Yanping Zhang
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8642396/
id doaj-1a55d3970ca44b89aa96f6b730fbdfff
record_format Article
spelling doaj-1a55d3970ca44b89aa96f6b730fbdfff2021-03-29T22:37:10ZengIEEEIEEE Access2169-35362019-01-017235372354810.1109/ACCESS.2019.28994678642396A Parameter-Free Cleaning Method for SMOTE in Imbalanced ClassificationYuanting Yan0https://orcid.org/0000-0001-6090-910XRuiqing Liu1Zihan Ding2Xiuquan Du3Jie Chen4Yanping Zhang5School of Computer Science and Technology, Anhui University, Hefei, ChinaSchool of Computer Science and Technology, Anhui University, Hefei, ChinaSchool of Computer Science and Technology, Anhui University, Hefei, ChinaSchool of Computer Science and Technology, Anhui University, Hefei, ChinaSchool of Computer Science and Technology, Anhui University, Hefei, ChinaSchool of Computer Science and Technology, Anhui University, Hefei, ChinaOversampling is an efficient technique in dealing with class-imbalance problem. It addresses the problem by reduplicating or generating the minority class samples to balance the distribution between the samples of the majority and the minority class. Synthetic minority oversampling technique (SMOTE) is one of the typical representatives. During the past decade, researchers have proposed many variants of SMOTE. However, the existing oversampling methods may generate wrong minority class samples in some scenarios. Furthermore, how to effectively mine the inherent complex characteristics of imbalanced data remains a challenge. To this end, this paper proposes a parameter-free data cleaning method to improve SMOTE based on constructive covering algorithm. The dataset generated by SMOTE is first partitioned into a group of covers, then the hard-to-learn samples can be detected based on the characteristics of sample space distribution. Finally, a pair-wise deletion strategy is proposed to remove the hard-to-learn samples. The experimental results on 25 imbalanced datasets show that our proposed method is superior to the comparison methods in terms of various metrics, such as F-measure, G-mean, and Recall. Our method not only can reduce the complexity of the dataset but also can improve the performance of the classification model.https://ieeexplore.ieee.org/document/8642396/Imbalanced dataSMOTEoversamplingconstructive covering algorithmdata cleaning
collection DOAJ
language English
format Article
sources DOAJ
author Yuanting Yan
Ruiqing Liu
Zihan Ding
Xiuquan Du
Jie Chen
Yanping Zhang
spellingShingle Yuanting Yan
Ruiqing Liu
Zihan Ding
Xiuquan Du
Jie Chen
Yanping Zhang
A Parameter-Free Cleaning Method for SMOTE in Imbalanced Classification
IEEE Access
Imbalanced data
SMOTE
oversampling
constructive covering algorithm
data cleaning
author_facet Yuanting Yan
Ruiqing Liu
Zihan Ding
Xiuquan Du
Jie Chen
Yanping Zhang
author_sort Yuanting Yan
title A Parameter-Free Cleaning Method for SMOTE in Imbalanced Classification
title_short A Parameter-Free Cleaning Method for SMOTE in Imbalanced Classification
title_full A Parameter-Free Cleaning Method for SMOTE in Imbalanced Classification
title_fullStr A Parameter-Free Cleaning Method for SMOTE in Imbalanced Classification
title_full_unstemmed A Parameter-Free Cleaning Method for SMOTE in Imbalanced Classification
title_sort parameter-free cleaning method for smote in imbalanced classification
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2019-01-01
description Oversampling is an efficient technique in dealing with class-imbalance problem. It addresses the problem by reduplicating or generating the minority class samples to balance the distribution between the samples of the majority and the minority class. Synthetic minority oversampling technique (SMOTE) is one of the typical representatives. During the past decade, researchers have proposed many variants of SMOTE. However, the existing oversampling methods may generate wrong minority class samples in some scenarios. Furthermore, how to effectively mine the inherent complex characteristics of imbalanced data remains a challenge. To this end, this paper proposes a parameter-free data cleaning method to improve SMOTE based on constructive covering algorithm. The dataset generated by SMOTE is first partitioned into a group of covers, then the hard-to-learn samples can be detected based on the characteristics of sample space distribution. Finally, a pair-wise deletion strategy is proposed to remove the hard-to-learn samples. The experimental results on 25 imbalanced datasets show that our proposed method is superior to the comparison methods in terms of various metrics, such as F-measure, G-mean, and Recall. Our method not only can reduce the complexity of the dataset but also can improve the performance of the classification model.
topic Imbalanced data
SMOTE
oversampling
constructive covering algorithm
data cleaning
url https://ieeexplore.ieee.org/document/8642396/
work_keys_str_mv AT yuantingyan aparameterfreecleaningmethodforsmoteinimbalancedclassification
AT ruiqingliu aparameterfreecleaningmethodforsmoteinimbalancedclassification
AT zihanding aparameterfreecleaningmethodforsmoteinimbalancedclassification
AT xiuquandu aparameterfreecleaningmethodforsmoteinimbalancedclassification
AT jiechen aparameterfreecleaningmethodforsmoteinimbalancedclassification
AT yanpingzhang aparameterfreecleaningmethodforsmoteinimbalancedclassification
AT yuantingyan parameterfreecleaningmethodforsmoteinimbalancedclassification
AT ruiqingliu parameterfreecleaningmethodforsmoteinimbalancedclassification
AT zihanding parameterfreecleaningmethodforsmoteinimbalancedclassification
AT xiuquandu parameterfreecleaningmethodforsmoteinimbalancedclassification
AT jiechen parameterfreecleaningmethodforsmoteinimbalancedclassification
AT yanpingzhang parameterfreecleaningmethodforsmoteinimbalancedclassification
_version_ 1724191119934226432