Hot Zone Identification: Analyzing Effects of Data Sampling on Spam Clustering

<span style="font-family: 'Times New Roman','serif'; font-size: 11pt; mso-fareast-font-family: Calibri; mso-ansi-language: EN-GB; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;" lang="EN-GB">Email is the most common and comparatively the most eff...

Full description

Bibliographic Details
Main Authors: Rasib Khan, Mainul Mizan, Ragib Hasan, Alan Sprague
Format: Article
Language:English
Published: Association of Digital Forensics, Security and Law 2014-03-01
Series:Journal of Digital Forensics, Security and Law
Online Access:http://ojs.jdfsl.org/index.php/jdfsl/article/view/251
id doaj-101b1acf869a46818ffd00e404cd8b9a
record_format Article
spelling doaj-101b1acf869a46818ffd00e404cd8b9a2020-11-25T02:23:44ZengAssociation of Digital Forensics, Security and LawJournal of Digital Forensics, Security and Law1558-72151558-72232014-03-01916782163Hot Zone Identification: Analyzing Effects of Data Sampling on Spam ClusteringRasib KhanMainul MizanRagib HasanAlan Sprague<span style="font-family: 'Times New Roman','serif'; font-size: 11pt; mso-fareast-font-family: Calibri; mso-ansi-language: EN-GB; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;" lang="EN-GB">Email is the most common and comparatively the most efficient means of exchanging information in today's world. However, given the widespread use of emails in all sectors, they have been the target of spammers since the beginning. Filtering spam emails has now led to critical actions such as forensic activities based on mining spam email. The data mine for spam emails at the University of Alabama at Birmingham is considered to be one of the most prominent resources for mining and identifying spam sources. It is a widely researched repository used by researchers from different global organizations. The usual process of mining the spam data involves going through every email in the data mine and clustering them based on their different attributes. However, given the size of the data mine, it takes an exceptionally long time to execute the clustering mechanism each time. In this paper, we have illustrated sampling as an efficient tool for data reduction, while preserving the information within the clusters, which would thus allow the spam forensic experts to quickly and effectively identify the ‘hot zone’ from the spam campaigns. We have provided detailed comparative analysis of the quality of the clusters after sampling, the overall distribution of clusters on the spam data, and timing measurements for our sampling approach. Additionally, we present different strategies which allowed us to optimize the sampling process using data-preprocessing and using the database engine's computational resources, and thus improving the performance of the clustering process</span>.http://ojs.jdfsl.org/index.php/jdfsl/article/view/251
collection DOAJ
language English
format Article
sources DOAJ
author Rasib Khan
Mainul Mizan
Ragib Hasan
Alan Sprague
spellingShingle Rasib Khan
Mainul Mizan
Ragib Hasan
Alan Sprague
Hot Zone Identification: Analyzing Effects of Data Sampling on Spam Clustering
Journal of Digital Forensics, Security and Law
author_facet Rasib Khan
Mainul Mizan
Ragib Hasan
Alan Sprague
author_sort Rasib Khan
title Hot Zone Identification: Analyzing Effects of Data Sampling on Spam Clustering
title_short Hot Zone Identification: Analyzing Effects of Data Sampling on Spam Clustering
title_full Hot Zone Identification: Analyzing Effects of Data Sampling on Spam Clustering
title_fullStr Hot Zone Identification: Analyzing Effects of Data Sampling on Spam Clustering
title_full_unstemmed Hot Zone Identification: Analyzing Effects of Data Sampling on Spam Clustering
title_sort hot zone identification: analyzing effects of data sampling on spam clustering
publisher Association of Digital Forensics, Security and Law
series Journal of Digital Forensics, Security and Law
issn 1558-7215
1558-7223
publishDate 2014-03-01
description <span style="font-family: 'Times New Roman','serif'; font-size: 11pt; mso-fareast-font-family: Calibri; mso-ansi-language: EN-GB; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;" lang="EN-GB">Email is the most common and comparatively the most efficient means of exchanging information in today's world. However, given the widespread use of emails in all sectors, they have been the target of spammers since the beginning. Filtering spam emails has now led to critical actions such as forensic activities based on mining spam email. The data mine for spam emails at the University of Alabama at Birmingham is considered to be one of the most prominent resources for mining and identifying spam sources. It is a widely researched repository used by researchers from different global organizations. The usual process of mining the spam data involves going through every email in the data mine and clustering them based on their different attributes. However, given the size of the data mine, it takes an exceptionally long time to execute the clustering mechanism each time. In this paper, we have illustrated sampling as an efficient tool for data reduction, while preserving the information within the clusters, which would thus allow the spam forensic experts to quickly and effectively identify the ‘hot zone’ from the spam campaigns. We have provided detailed comparative analysis of the quality of the clusters after sampling, the overall distribution of clusters on the spam data, and timing measurements for our sampling approach. Additionally, we present different strategies which allowed us to optimize the sampling process using data-preprocessing and using the database engine's computational resources, and thus improving the performance of the clustering process</span>.
url http://ojs.jdfsl.org/index.php/jdfsl/article/view/251
work_keys_str_mv AT rasibkhan hotzoneidentificationanalyzingeffectsofdatasamplingonspamclustering
AT mainulmizan hotzoneidentificationanalyzingeffectsofdatasamplingonspamclustering
AT ragibhasan hotzoneidentificationanalyzingeffectsofdatasamplingonspamclustering
AT alansprague hotzoneidentificationanalyzingeffectsofdatasamplingonspamclustering
_version_ 1724857608979873792