Consensus Clustering-Based Undersampling Approach to Imbalanced Learning

Class imbalance is an important problem, encountered in machine learning applications, where one class (named as, the minority class) has extremely small number of instances and the other class (referred as, the majority class) has immense quantity of instances. Imbalanced datasets can be of great i...

Full description

Bibliographic Details
Main Author: Aytuğ Onan
Format: Article
Language:English
Published: Hindawi Limited 2019-01-01
Series:Scientific Programming
Online Access:http://dx.doi.org/10.1155/2019/5901087
id doaj-e4538631d9e044a4a5ff91567b78fbeb
record_format Article
spelling doaj-e4538631d9e044a4a5ff91567b78fbeb2021-07-02T10:28:37ZengHindawi LimitedScientific Programming1058-92441875-919X2019-01-01201910.1155/2019/59010875901087Consensus Clustering-Based Undersampling Approach to Imbalanced LearningAytuğ Onan0İzmir Katip Çelebi University, Faculty of Engineering and Architecture, Department of Computer Engineering, 35620 İzmir, TurkeyClass imbalance is an important problem, encountered in machine learning applications, where one class (named as, the minority class) has extremely small number of instances and the other class (referred as, the majority class) has immense quantity of instances. Imbalanced datasets can be of great importance in several real-world applications, including medical diagnosis, malware detection, anomaly identification, bankruptcy prediction, and spam filtering. In this paper, we present a consensus clustering based-undersampling approach to imbalanced learning. In this scheme, the number of instances in the majority class was undersampled by utilizing a consensus clustering-based scheme. In the empirical analysis, 44 small-scale and 2 large-scale imbalanced classification benchmarks have been utilized. In the consensus clustering schemes, five clustering algorithms (namely, k-means, k-modes, k-means++, self-organizing maps, and DIANA algorithm) and their combinations were taken into consideration. In the classification phase, five supervised learning methods (namely, naïve Bayes, logistic regression, support vector machines, random forests, and k-nearest neighbor algorithm) and three ensemble learner methods (namely, AdaBoost, bagging, and random subspace algorithm) were utilized. The empirical results indicate that the proposed heterogeneous consensus clustering-based undersampling scheme yields better predictive performance.http://dx.doi.org/10.1155/2019/5901087
collection DOAJ
language English
format Article
sources DOAJ
author Aytuğ Onan
spellingShingle Aytuğ Onan
Consensus Clustering-Based Undersampling Approach to Imbalanced Learning
Scientific Programming
author_facet Aytuğ Onan
author_sort Aytuğ Onan
title Consensus Clustering-Based Undersampling Approach to Imbalanced Learning
title_short Consensus Clustering-Based Undersampling Approach to Imbalanced Learning
title_full Consensus Clustering-Based Undersampling Approach to Imbalanced Learning
title_fullStr Consensus Clustering-Based Undersampling Approach to Imbalanced Learning
title_full_unstemmed Consensus Clustering-Based Undersampling Approach to Imbalanced Learning
title_sort consensus clustering-based undersampling approach to imbalanced learning
publisher Hindawi Limited
series Scientific Programming
issn 1058-9244
1875-919X
publishDate 2019-01-01
description Class imbalance is an important problem, encountered in machine learning applications, where one class (named as, the minority class) has extremely small number of instances and the other class (referred as, the majority class) has immense quantity of instances. Imbalanced datasets can be of great importance in several real-world applications, including medical diagnosis, malware detection, anomaly identification, bankruptcy prediction, and spam filtering. In this paper, we present a consensus clustering based-undersampling approach to imbalanced learning. In this scheme, the number of instances in the majority class was undersampled by utilizing a consensus clustering-based scheme. In the empirical analysis, 44 small-scale and 2 large-scale imbalanced classification benchmarks have been utilized. In the consensus clustering schemes, five clustering algorithms (namely, k-means, k-modes, k-means++, self-organizing maps, and DIANA algorithm) and their combinations were taken into consideration. In the classification phase, five supervised learning methods (namely, naïve Bayes, logistic regression, support vector machines, random forests, and k-nearest neighbor algorithm) and three ensemble learner methods (namely, AdaBoost, bagging, and random subspace algorithm) were utilized. The empirical results indicate that the proposed heterogeneous consensus clustering-based undersampling scheme yields better predictive performance.
url http://dx.doi.org/10.1155/2019/5901087
work_keys_str_mv AT aytugonan consensusclusteringbasedundersamplingapproachtoimbalancedlearning
_version_ 1721332058391838720