Categorizing Blog Spam

The internet has matured into the focal point of our era. Its ecosystem is vast, complex, and in many regards unaccounted for. One of the most prevalent aspects of the internet is spam. Similar to the rest of the internet, spam has evolved from simply meaning ‘unwanted emails’ to a blanket term that...

Full description

Bibliographic Details
Main Author: Bevans, Brandon
Format: Others
Published: DigitalCommons@CalPoly 2016
Subjects:
Online Access:https://digitalcommons.calpoly.edu/theses/1623
https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?article=2820&context=theses
id ndltd-CALPOLY-oai-digitalcommons.calpoly.edu-theses-2820
record_format oai_dc
spelling ndltd-CALPOLY-oai-digitalcommons.calpoly.edu-theses-28202021-08-20T05:02:19Z Categorizing Blog Spam Bevans, Brandon The internet has matured into the focal point of our era. Its ecosystem is vast, complex, and in many regards unaccounted for. One of the most prevalent aspects of the internet is spam. Similar to the rest of the internet, spam has evolved from simply meaning ‘unwanted emails’ to a blanket term that encompasses any unsolicited or illegitimate content that appears in the wide range of media that exists on the internet. Many forms of spam permeate the internet, and spam architects continue to develop tools and methods to avoid detection. On the other side, cyber security engineers continue to develop more sophisticated detection tools to curb the harmful effects that come with spam. This virtual arms race has no end in sight. Most efforts thus far have been toward accurately detecting spam from ham, and rightfully so since initial detection is essential. However, research is lacking in understanding the current ecosystem of spam, spam campaigns, and the behavior of the botnets that drive the majority of spam traffic. This thesis focuses on characterizing spam, particularly the spam that appears in forums, where the spam is delivered by bots posing as legitimate users. Forum spam is used primarily to push advertisements or to boost other websites’ perceived popularity by including HTTP links in the content of the post. We conduct an experiment to collect a sample of the blog posts and network activity of the spambots that exist in the internet. We then present a corpora available to conduct analysis on and proceed with our own analysis. We cluster associated groups of users and IP addresses into entities, which we accept as a model of the underlying botnets that interact with our honeypots. We use Natural Language Processing (NLP) and Machine Learning (ML) to determine that creating semantic-based models of botnets are sufficient for distinguishing them from one another. We also find that the syntactic structure of posts has little variation from botnet to botnet. Finally we confirm that to a large degree botnet behavior and content hold across different domains. 2016-06-01T07:00:00Z text application/pdf https://digitalcommons.calpoly.edu/theses/1623 https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?article=2820&context=theses Master's Theses DigitalCommons@CalPoly machine learning blog spam natural language processing corpus botnet Artificial Intelligence and Robotics Information Security Other Computer Sciences
collection NDLTD
format Others
sources NDLTD
topic machine learning
blog
spam
natural language processing
corpus
botnet
Artificial Intelligence and Robotics
Information Security
Other Computer Sciences
spellingShingle machine learning
blog
spam
natural language processing
corpus
botnet
Artificial Intelligence and Robotics
Information Security
Other Computer Sciences
Bevans, Brandon
Categorizing Blog Spam
description The internet has matured into the focal point of our era. Its ecosystem is vast, complex, and in many regards unaccounted for. One of the most prevalent aspects of the internet is spam. Similar to the rest of the internet, spam has evolved from simply meaning ‘unwanted emails’ to a blanket term that encompasses any unsolicited or illegitimate content that appears in the wide range of media that exists on the internet. Many forms of spam permeate the internet, and spam architects continue to develop tools and methods to avoid detection. On the other side, cyber security engineers continue to develop more sophisticated detection tools to curb the harmful effects that come with spam. This virtual arms race has no end in sight. Most efforts thus far have been toward accurately detecting spam from ham, and rightfully so since initial detection is essential. However, research is lacking in understanding the current ecosystem of spam, spam campaigns, and the behavior of the botnets that drive the majority of spam traffic. This thesis focuses on characterizing spam, particularly the spam that appears in forums, where the spam is delivered by bots posing as legitimate users. Forum spam is used primarily to push advertisements or to boost other websites’ perceived popularity by including HTTP links in the content of the post. We conduct an experiment to collect a sample of the blog posts and network activity of the spambots that exist in the internet. We then present a corpora available to conduct analysis on and proceed with our own analysis. We cluster associated groups of users and IP addresses into entities, which we accept as a model of the underlying botnets that interact with our honeypots. We use Natural Language Processing (NLP) and Machine Learning (ML) to determine that creating semantic-based models of botnets are sufficient for distinguishing them from one another. We also find that the syntactic structure of posts has little variation from botnet to botnet. Finally we confirm that to a large degree botnet behavior and content hold across different domains.
author Bevans, Brandon
author_facet Bevans, Brandon
author_sort Bevans, Brandon
title Categorizing Blog Spam
title_short Categorizing Blog Spam
title_full Categorizing Blog Spam
title_fullStr Categorizing Blog Spam
title_full_unstemmed Categorizing Blog Spam
title_sort categorizing blog spam
publisher DigitalCommons@CalPoly
publishDate 2016
url https://digitalcommons.calpoly.edu/theses/1623
https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?article=2820&context=theses
work_keys_str_mv AT bevansbrandon categorizingblogspam
_version_ 1719460436481409024