Detecting web attacks using random undersampling and ensemble learners

Abstract Class imbalance is an important consideration for cybersecurity and machine learning. We explore classification performance in detecting web attacks in the recent CSE-CIC-IDS2018 dataset. This study considers a total of eight random undersampling (RUS) ratios: no sampling, 999:1, 99:1, 95:5...

Full description

Bibliographic Details
Main Authors: Richard Zuech, John Hancock, Taghi M. Khoshgoftaar
Format: Article
Language:English
Published: SpringerOpen 2021-05-01
Series:Journal of Big Data
Subjects:
Online Access:https://doi.org/10.1186/s40537-021-00460-8
id doaj-5346b0eb73bc4f4ebef27ec624dbf98e
record_format Article
spelling doaj-5346b0eb73bc4f4ebef27ec624dbf98e2021-05-30T11:51:33ZengSpringerOpenJournal of Big Data2196-11152021-05-018112010.1186/s40537-021-00460-8Detecting web attacks using random undersampling and ensemble learnersRichard Zuech0John Hancock1Taghi M. Khoshgoftaar2Florida Atlantic UniversityFlorida Atlantic UniversityFlorida Atlantic UniversityAbstract Class imbalance is an important consideration for cybersecurity and machine learning. We explore classification performance in detecting web attacks in the recent CSE-CIC-IDS2018 dataset. This study considers a total of eight random undersampling (RUS) ratios: no sampling, 999:1, 99:1, 95:5, 9:1, 3:1, 65:35, and 1:1. Additionally, seven different classifiers are employed: Decision Tree (DT), Random Forest (RF), CatBoost (CB), LightGBM (LGB), XGBoost (XGB), Naive Bayes (NB), and Logistic Regression (LR). For classification performance metrics, Area Under the Receiver Operating Characteristic Curve (AUC) and Area Under the Precision-Recall Curve (AUPRC) are both utilized to answer the following three research questions. The first question asks: “Are various random undersampling ratios statistically different from each other in detecting web attacks?” The second question asks: “Are different classifiers statistically different from each other in detecting web attacks?” And, our third question asks: “Is the interaction between different classifiers and random undersampling ratios significant for detecting web attacks?” Based on our experiments, the answers to all three research questions is “Yes”. To the best of our knowledge, we are the first to apply random undersampling techniques to web attacks from the CSE-CIC-IDS2018 dataset while exploring various sampling ratios.https://doi.org/10.1186/s40537-021-00460-8CSE-CIC-IDS2018Intrusion DetectionWeb AttacksClass ImbalanceRandom UndersamplingEnsemble Learners
collection DOAJ
language English
format Article
sources DOAJ
author Richard Zuech
John Hancock
Taghi M. Khoshgoftaar
spellingShingle Richard Zuech
John Hancock
Taghi M. Khoshgoftaar
Detecting web attacks using random undersampling and ensemble learners
Journal of Big Data
CSE-CIC-IDS2018
Intrusion Detection
Web Attacks
Class Imbalance
Random Undersampling
Ensemble Learners
author_facet Richard Zuech
John Hancock
Taghi M. Khoshgoftaar
author_sort Richard Zuech
title Detecting web attacks using random undersampling and ensemble learners
title_short Detecting web attacks using random undersampling and ensemble learners
title_full Detecting web attacks using random undersampling and ensemble learners
title_fullStr Detecting web attacks using random undersampling and ensemble learners
title_full_unstemmed Detecting web attacks using random undersampling and ensemble learners
title_sort detecting web attacks using random undersampling and ensemble learners
publisher SpringerOpen
series Journal of Big Data
issn 2196-1115
publishDate 2021-05-01
description Abstract Class imbalance is an important consideration for cybersecurity and machine learning. We explore classification performance in detecting web attacks in the recent CSE-CIC-IDS2018 dataset. This study considers a total of eight random undersampling (RUS) ratios: no sampling, 999:1, 99:1, 95:5, 9:1, 3:1, 65:35, and 1:1. Additionally, seven different classifiers are employed: Decision Tree (DT), Random Forest (RF), CatBoost (CB), LightGBM (LGB), XGBoost (XGB), Naive Bayes (NB), and Logistic Regression (LR). For classification performance metrics, Area Under the Receiver Operating Characteristic Curve (AUC) and Area Under the Precision-Recall Curve (AUPRC) are both utilized to answer the following three research questions. The first question asks: “Are various random undersampling ratios statistically different from each other in detecting web attacks?” The second question asks: “Are different classifiers statistically different from each other in detecting web attacks?” And, our third question asks: “Is the interaction between different classifiers and random undersampling ratios significant for detecting web attacks?” Based on our experiments, the answers to all three research questions is “Yes”. To the best of our knowledge, we are the first to apply random undersampling techniques to web attacks from the CSE-CIC-IDS2018 dataset while exploring various sampling ratios.
topic CSE-CIC-IDS2018
Intrusion Detection
Web Attacks
Class Imbalance
Random Undersampling
Ensemble Learners
url https://doi.org/10.1186/s40537-021-00460-8
work_keys_str_mv AT richardzuech detectingwebattacksusingrandomundersamplingandensemblelearners
AT johnhancock detectingwebattacksusingrandomundersamplingandensemblelearners
AT taghimkhoshgoftaar detectingwebattacksusingrandomundersamplingandensemblelearners
_version_ 1721419917665763328