Spam Filter Improvement Through Measurement

This work supports the thesis that sound quantitative evaluation for spam filters leads to substantial improvement in the classification of email. To this end, new laboratory testing methods and datasets are introduced, and evidence is presented that their adoption at Text REtrieval Conference (TREC...

Full description

Bibliographic Details
Main Author: Lynam, Thomas Richard
Language:en
Published: 2009
Subjects:
Online Access:http://hdl.handle.net/10012/4344
id ndltd-LACETR-oai-collectionscanada.gc.ca-OWTU.10012-4344
record_format oai_dc
spelling ndltd-LACETR-oai-collectionscanada.gc.ca-OWTU.10012-43442013-10-04T04:09:07ZLynam, Thomas Richard2009-04-27T18:24:57Z2009-04-27T18:24:57Z2009-04-27T18:24:57Z2009http://hdl.handle.net/10012/4344This work supports the thesis that sound quantitative evaluation for spam filters leads to substantial improvement in the classification of email. To this end, new laboratory testing methods and datasets are introduced, and evidence is presented that their adoption at Text REtrieval Conference (TREC)and elsewhere has led to an improvement in state of the art spam filtering. While many of these improvements have been discovered by others, the best-performing method known at this time -- spam filter fusion -- was demonstrated by the author. This work describes four principal dimensions of spam filter evaluation methodology and spam filter improvement. An initial study investigates the application of twelve open-source filter configurations in a laboratory environment, using a stream of 50,000 messages captured from a single recipient over eight months. The study measures the impact of user feedback and on-line learning on filter performance using methodology and measures which were released to the research community as the TREC Spam Filter Evaluation Toolkit. The toolkit was used as the basis of the TREC Spam Track, which the author co-founded with Cormack. The Spam Track, in addition to evaluating a new application (email spam), addressed the issue of testing systems on both private and public data. While streams of private messages are most realistic, they are not easy to come by and cannot be shared with the research community as archival benchmarks. Using the toolkit, participant filters were evaluated on both, and the differences found not to substantially confound evaluation; as a result, public corpora were validated as research tools. Over the course of TREC and similar evaluation efforts, a dozen or more archival benchmarks -- some private and some public -- have become available. The toolkit and methodology have spawned improvements in the state of the art every year since its deployment in 2005. In 2005, 2006, and 2007, the spam track yielded new best-performing systems based on sequential compression models, orthogonal sparse bigram features, logistic regression and support vector machines. Using the TREC participant filters, we develop and demonstrate methods for on-line filter fusion that outperform all other reported on-line personal spam filters.enevaluation methodologyspam filteringspam corporaspam fusionSpam Filter Improvement Through MeasurementThesis or DissertationSchool of Computer ScienceDoctor of PhilosophyComputer Science
collection NDLTD
language en
sources NDLTD
topic evaluation methodology
spam filtering
spam corpora
spam fusion
Computer Science
spellingShingle evaluation methodology
spam filtering
spam corpora
spam fusion
Computer Science
Lynam, Thomas Richard
Spam Filter Improvement Through Measurement
description This work supports the thesis that sound quantitative evaluation for spam filters leads to substantial improvement in the classification of email. To this end, new laboratory testing methods and datasets are introduced, and evidence is presented that their adoption at Text REtrieval Conference (TREC)and elsewhere has led to an improvement in state of the art spam filtering. While many of these improvements have been discovered by others, the best-performing method known at this time -- spam filter fusion -- was demonstrated by the author. This work describes four principal dimensions of spam filter evaluation methodology and spam filter improvement. An initial study investigates the application of twelve open-source filter configurations in a laboratory environment, using a stream of 50,000 messages captured from a single recipient over eight months. The study measures the impact of user feedback and on-line learning on filter performance using methodology and measures which were released to the research community as the TREC Spam Filter Evaluation Toolkit. The toolkit was used as the basis of the TREC Spam Track, which the author co-founded with Cormack. The Spam Track, in addition to evaluating a new application (email spam), addressed the issue of testing systems on both private and public data. While streams of private messages are most realistic, they are not easy to come by and cannot be shared with the research community as archival benchmarks. Using the toolkit, participant filters were evaluated on both, and the differences found not to substantially confound evaluation; as a result, public corpora were validated as research tools. Over the course of TREC and similar evaluation efforts, a dozen or more archival benchmarks -- some private and some public -- have become available. The toolkit and methodology have spawned improvements in the state of the art every year since its deployment in 2005. In 2005, 2006, and 2007, the spam track yielded new best-performing systems based on sequential compression models, orthogonal sparse bigram features, logistic regression and support vector machines. Using the TREC participant filters, we develop and demonstrate methods for on-line filter fusion that outperform all other reported on-line personal spam filters.
author Lynam, Thomas Richard
author_facet Lynam, Thomas Richard
author_sort Lynam, Thomas Richard
title Spam Filter Improvement Through Measurement
title_short Spam Filter Improvement Through Measurement
title_full Spam Filter Improvement Through Measurement
title_fullStr Spam Filter Improvement Through Measurement
title_full_unstemmed Spam Filter Improvement Through Measurement
title_sort spam filter improvement through measurement
publishDate 2009
url http://hdl.handle.net/10012/4344
work_keys_str_mv AT lynamthomasrichard spamfilterimprovementthroughmeasurement
_version_ 1716600166949060608