Practical Web Spam Lifelong Machine Learning System with Automatic Adjustment to Current Lifecycle Phase

Machine learning techniques are a standard approach in spam detection. Their quality depends on the quality of the learning set, and when the set is out of date, the quality of classification falls rapidly. The most popular public web spam dataset that can be used to train a spam detector—WEBSPAM-UK...

Full description

Bibliographic Details
Main Author: Marcin Luckner
Format: Article
Language:English
Published: Hindawi-Wiley 2019-01-01
Series:Security and Communication Networks
Online Access:http://dx.doi.org/10.1155/2019/6587020
id doaj-6bfa34ae41c34046bde7d17e717ff651
record_format Article
spelling doaj-6bfa34ae41c34046bde7d17e717ff6512020-11-25T01:21:29ZengHindawi-WileySecurity and Communication Networks1939-01141939-01222019-01-01201910.1155/2019/65870206587020Practical Web Spam Lifelong Machine Learning System with Automatic Adjustment to Current Lifecycle PhaseMarcin Luckner0Faculty of Mathematics and Information Science, Warsaw University of Technology, Koszykowa 75 Street, 00-662 Warsaw, PolandMachine learning techniques are a standard approach in spam detection. Their quality depends on the quality of the learning set, and when the set is out of date, the quality of classification falls rapidly. The most popular public web spam dataset that can be used to train a spam detector—WEBSPAM-UK2007—is over ten years old. Therefore, there is a place for a lifelong machine learning system that can replace the detectors based on a static learning set. In this paper, we propose a novel web spam recognition system. The system automatically rebuilds the learning set to avoid classification based on outdated data. Using a built-in automatic selection of the active classifier the system very quickly attains productive accuracy despite a limited learning set. Moreover, the system automatically rebuilds the learning set using external data from spam traps and popular web services. A test on real data from Quora, Reddit, and Stack Overflow proved the high recognition quality. Both the obtained average accuracy and the F-measure were 0.98 and 0.96 for semiautomatic and full–automatic mode, respectively.http://dx.doi.org/10.1155/2019/6587020
collection DOAJ
language English
format Article
sources DOAJ
author Marcin Luckner
spellingShingle Marcin Luckner
Practical Web Spam Lifelong Machine Learning System with Automatic Adjustment to Current Lifecycle Phase
Security and Communication Networks
author_facet Marcin Luckner
author_sort Marcin Luckner
title Practical Web Spam Lifelong Machine Learning System with Automatic Adjustment to Current Lifecycle Phase
title_short Practical Web Spam Lifelong Machine Learning System with Automatic Adjustment to Current Lifecycle Phase
title_full Practical Web Spam Lifelong Machine Learning System with Automatic Adjustment to Current Lifecycle Phase
title_fullStr Practical Web Spam Lifelong Machine Learning System with Automatic Adjustment to Current Lifecycle Phase
title_full_unstemmed Practical Web Spam Lifelong Machine Learning System with Automatic Adjustment to Current Lifecycle Phase
title_sort practical web spam lifelong machine learning system with automatic adjustment to current lifecycle phase
publisher Hindawi-Wiley
series Security and Communication Networks
issn 1939-0114
1939-0122
publishDate 2019-01-01
description Machine learning techniques are a standard approach in spam detection. Their quality depends on the quality of the learning set, and when the set is out of date, the quality of classification falls rapidly. The most popular public web spam dataset that can be used to train a spam detector—WEBSPAM-UK2007—is over ten years old. Therefore, there is a place for a lifelong machine learning system that can replace the detectors based on a static learning set. In this paper, we propose a novel web spam recognition system. The system automatically rebuilds the learning set to avoid classification based on outdated data. Using a built-in automatic selection of the active classifier the system very quickly attains productive accuracy despite a limited learning set. Moreover, the system automatically rebuilds the learning set using external data from spam traps and popular web services. A test on real data from Quora, Reddit, and Stack Overflow proved the high recognition quality. Both the obtained average accuracy and the F-measure were 0.98 and 0.96 for semiautomatic and full–automatic mode, respectively.
url http://dx.doi.org/10.1155/2019/6587020
work_keys_str_mv AT marcinluckner practicalwebspamlifelongmachinelearningsystemwithautomaticadjustmenttocurrentlifecyclephase
_version_ 1725129972316635136