Examining characteristics of predictive models with imbalanced big data

Abstract High class imbalance between majority and minority classes in datasets can skew the performance of Machine Learning algorithms and bias predictions in favor of the majority (negative) class. This bias, for cases where the minority (positive) class is of greater interest and the occurrence o...

Full description

Bibliographic Details
Main Authors:	Tawfiq Hasanin, Taghi M. Khoshgoftaar, Joffrey L. Leevy, Naeem Seliya
Format:	Article
Language:	English
Published:	SpringerOpen 2019-07-01
Series:	Journal of Big Data
Subjects:	Big data Feature Importance Feature Selection Class Imbalance Machine Learning Random Undersampling
Online Access:	http://link.springer.com/article/10.1186/s40537-019-0231-2

id	doaj-c9275c77d4c94a5c8395c6d89c8975af
record_format	Article
spelling	doaj-c9275c77d4c94a5c8395c6d89c8975af2020-11-25T02:57:36ZengSpringerOpenJournal of Big Data2196-11152019-07-016112110.1186/s40537-019-0231-2Examining characteristics of predictive models with imbalanced big dataTawfiq Hasanin0Taghi M. Khoshgoftaar1Joffrey L. Leevy2Naeem Seliya3Florida Atlantic UniversityFlorida Atlantic UniversityFlorida Atlantic UniversityOhio Northern UniversityAbstract High class imbalance between majority and minority classes in datasets can skew the performance of Machine Learning algorithms and bias predictions in favor of the majority (negative) class. This bias, for cases where the minority (positive) class is of greater interest and the occurrence of false negatives is costlier than false positives, may result in adverse consequences. Our paper presents two case studies, each utilizing a unique, combined approach of Random Undersampling and Feature Selection to investigate the effect of class imbalance on big data analytics. Random Undersampling is used to generate six class distributions ranging from balanced to moderately imbalanced, and Feature Importance is used as our Feature Selection method. Classification performance was reported for the Random Forest, Gradient-Boosted Trees, and Logistic Regression learners, as implemented within the Apache Spark framework. The first case study utilized a training dataset and a test dataset from the ECBDL’14 bioinformatics competition. The training and test datasets contain about 32 million instances and 2.9 million instances, respectively. For the first case study, Gradient-Boosted Trees obtained the best results, with either a features-set of 60 or the full set, and a negative-to-positive ratio of either 45:55 or 40:60. The second case study, unlike the first, included training data from one source (POST dataset) and test data from a separate source (Slowloris dataset), where POST and Slowloris are two types of Denial of Service attacks. The POST dataset contains about 1.7 million instances, while the Slowloris dataset contains about 0.2 million instances. For the second case study, Logistic Regression obtained the best results, with a features-set of 5 and any of the following negative-to-positive ratios: 40:60, 45:55, 50:50, 65:35, and 75:25. We conclude that combining Feature Selection with Random Undersampling improves the classification performance of learners with imbalanced big data from different application domains.http://link.springer.com/article/10.1186/s40537-019-0231-2Big dataFeature ImportanceFeature SelectionClass ImbalanceMachine LearningRandom Undersampling
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Tawfiq Hasanin Taghi M. Khoshgoftaar Joffrey L. Leevy Naeem Seliya
spellingShingle	Tawfiq Hasanin Taghi M. Khoshgoftaar Joffrey L. Leevy Naeem Seliya Examining characteristics of predictive models with imbalanced big data Journal of Big Data Big data Feature Importance Feature Selection Class Imbalance Machine Learning Random Undersampling
author_facet	Tawfiq Hasanin Taghi M. Khoshgoftaar Joffrey L. Leevy Naeem Seliya
author_sort	Tawfiq Hasanin
title	Examining characteristics of predictive models with imbalanced big data
title_short	Examining characteristics of predictive models with imbalanced big data
title_full	Examining characteristics of predictive models with imbalanced big data
title_fullStr	Examining characteristics of predictive models with imbalanced big data
title_full_unstemmed	Examining characteristics of predictive models with imbalanced big data
title_sort	examining characteristics of predictive models with imbalanced big data
publisher	SpringerOpen
series	Journal of Big Data
issn	2196-1115
publishDate	2019-07-01
description	Abstract High class imbalance between majority and minority classes in datasets can skew the performance of Machine Learning algorithms and bias predictions in favor of the majority (negative) class. This bias, for cases where the minority (positive) class is of greater interest and the occurrence of false negatives is costlier than false positives, may result in adverse consequences. Our paper presents two case studies, each utilizing a unique, combined approach of Random Undersampling and Feature Selection to investigate the effect of class imbalance on big data analytics. Random Undersampling is used to generate six class distributions ranging from balanced to moderately imbalanced, and Feature Importance is used as our Feature Selection method. Classification performance was reported for the Random Forest, Gradient-Boosted Trees, and Logistic Regression learners, as implemented within the Apache Spark framework. The first case study utilized a training dataset and a test dataset from the ECBDL’14 bioinformatics competition. The training and test datasets contain about 32 million instances and 2.9 million instances, respectively. For the first case study, Gradient-Boosted Trees obtained the best results, with either a features-set of 60 or the full set, and a negative-to-positive ratio of either 45:55 or 40:60. The second case study, unlike the first, included training data from one source (POST dataset) and test data from a separate source (Slowloris dataset), where POST and Slowloris are two types of Denial of Service attacks. The POST dataset contains about 1.7 million instances, while the Slowloris dataset contains about 0.2 million instances. For the second case study, Logistic Regression obtained the best results, with a features-set of 5 and any of the following negative-to-positive ratios: 40:60, 45:55, 50:50, 65:35, and 75:25. We conclude that combining Feature Selection with Random Undersampling improves the classification performance of learners with imbalanced big data from different application domains.
topic	Big data Feature Importance Feature Selection Class Imbalance Machine Learning Random Undersampling
url	http://link.springer.com/article/10.1186/s40537-019-0231-2
work_keys_str_mv	AT tawfiqhasanin examiningcharacteristicsofpredictivemodelswithimbalancedbigdata AT taghimkhoshgoftaar examiningcharacteristicsofpredictivemodelswithimbalancedbigdata AT joffreylleevy examiningcharacteristicsofpredictivemodelswithimbalancedbigdata AT naeemseliya examiningcharacteristicsofpredictivemodelswithimbalancedbigdata
_version_	1724710310918488064

Examining characteristics of predictive models with imbalanced big data

Similar Items