Hash-based Histogram Compression Technique for E-Mail Spam Filtering

碩士 === 真理大學 === 統計與精算學系碩士班 === 98 === After 1990, the Internet changes the world and Spam become a problem in people’s life. although there are many solution try to solve this problem, but each solution has different advantage in different place and always could be better. for example: there are som...

Full description

Bibliographic Details
Main Authors: Eric Joung, 莊斐然
Other Authors: Dr. Jian-Hua Yeh
Format: Others
Language:zh-TW
Published: 2010
Online Access:http://ndltd.ncl.edu.tw/handle/05575381161648024970
Description
Summary:碩士 === 真理大學 === 統計與精算學系碩士班 === 98 === After 1990, the Internet changes the world and Spam become a problem in people’s life. although there are many solution try to solve this problem, but each solution has different advantage in different place and always could be better. for example: there are some solution runs very fast, but precision is not good enough . and there are some different solution have good precision but take lots of time to get the result or answer. In this paper, we use a small technique to separate terms between spam and non-spam files to two term sets without the intersect between two term sets, and generating a term list with two sets. after this , we use the term list to generate term frequency matrix and compress it by a special hash method. In the last, we use the hashed matrix as input for SVM to get the result. In our experiment, we use Trec 2007 Spam Track as data set with 95% training set and 5% test set. and the result shows that although our method are not the best one on precision but much faster than others in the same precision level. it means that it is very easy and possible to use our method as a component in a real Spam Filter System.