Consensus Clustering-Based Undersampling Approach to Imbalanced Learning
Class imbalance is an important problem, encountered in machine learning applications, where one class (named as, the minority class) has extremely small number of instances and the other class (referred as, the majority class) has immense quantity of instances. Imbalanced datasets can be of great i...
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
Hindawi Limited
2019-01-01
|
Series: | Scientific Programming |
Online Access: | http://dx.doi.org/10.1155/2019/5901087 |
id |
doaj-e4538631d9e044a4a5ff91567b78fbeb |
---|---|
record_format |
Article |
spelling |
doaj-e4538631d9e044a4a5ff91567b78fbeb2021-07-02T10:28:37ZengHindawi LimitedScientific Programming1058-92441875-919X2019-01-01201910.1155/2019/59010875901087Consensus Clustering-Based Undersampling Approach to Imbalanced LearningAytuğ Onan0İzmir Katip Çelebi University, Faculty of Engineering and Architecture, Department of Computer Engineering, 35620 İzmir, TurkeyClass imbalance is an important problem, encountered in machine learning applications, where one class (named as, the minority class) has extremely small number of instances and the other class (referred as, the majority class) has immense quantity of instances. Imbalanced datasets can be of great importance in several real-world applications, including medical diagnosis, malware detection, anomaly identification, bankruptcy prediction, and spam filtering. In this paper, we present a consensus clustering based-undersampling approach to imbalanced learning. In this scheme, the number of instances in the majority class was undersampled by utilizing a consensus clustering-based scheme. In the empirical analysis, 44 small-scale and 2 large-scale imbalanced classification benchmarks have been utilized. In the consensus clustering schemes, five clustering algorithms (namely, k-means, k-modes, k-means++, self-organizing maps, and DIANA algorithm) and their combinations were taken into consideration. In the classification phase, five supervised learning methods (namely, naïve Bayes, logistic regression, support vector machines, random forests, and k-nearest neighbor algorithm) and three ensemble learner methods (namely, AdaBoost, bagging, and random subspace algorithm) were utilized. The empirical results indicate that the proposed heterogeneous consensus clustering-based undersampling scheme yields better predictive performance.http://dx.doi.org/10.1155/2019/5901087 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Aytuğ Onan |
spellingShingle |
Aytuğ Onan Consensus Clustering-Based Undersampling Approach to Imbalanced Learning Scientific Programming |
author_facet |
Aytuğ Onan |
author_sort |
Aytuğ Onan |
title |
Consensus Clustering-Based Undersampling Approach to Imbalanced Learning |
title_short |
Consensus Clustering-Based Undersampling Approach to Imbalanced Learning |
title_full |
Consensus Clustering-Based Undersampling Approach to Imbalanced Learning |
title_fullStr |
Consensus Clustering-Based Undersampling Approach to Imbalanced Learning |
title_full_unstemmed |
Consensus Clustering-Based Undersampling Approach to Imbalanced Learning |
title_sort |
consensus clustering-based undersampling approach to imbalanced learning |
publisher |
Hindawi Limited |
series |
Scientific Programming |
issn |
1058-9244 1875-919X |
publishDate |
2019-01-01 |
description |
Class imbalance is an important problem, encountered in machine learning applications, where one class (named as, the minority class) has extremely small number of instances and the other class (referred as, the majority class) has immense quantity of instances. Imbalanced datasets can be of great importance in several real-world applications, including medical diagnosis, malware detection, anomaly identification, bankruptcy prediction, and spam filtering. In this paper, we present a consensus clustering based-undersampling approach to imbalanced learning. In this scheme, the number of instances in the majority class was undersampled by utilizing a consensus clustering-based scheme. In the empirical analysis, 44 small-scale and 2 large-scale imbalanced classification benchmarks have been utilized. In the consensus clustering schemes, five clustering algorithms (namely, k-means, k-modes, k-means++, self-organizing maps, and DIANA algorithm) and their combinations were taken into consideration. In the classification phase, five supervised learning methods (namely, naïve Bayes, logistic regression, support vector machines, random forests, and k-nearest neighbor algorithm) and three ensemble learner methods (namely, AdaBoost, bagging, and random subspace algorithm) were utilized. The empirical results indicate that the proposed heterogeneous consensus clustering-based undersampling scheme yields better predictive performance. |
url |
http://dx.doi.org/10.1155/2019/5901087 |
work_keys_str_mv |
AT aytugonan consensusclusteringbasedundersamplingapproachtoimbalancedlearning |
_version_ |
1721332058391838720 |