Summary: | 碩士 === 國立東華大學 === 資訊管理碩士學位學程 === 100 === E-mail is a very important and successful application of the Internet. And e-mails are widely used in electronic advertisement. Because the marginal cost of copying and distributing an e-mail is close to zero, "spam" is causing more and more damage to the Internet ecosystem. Spammers use spam as a marketing tool, hoping to increase visibility and revenue of their websites.
In this paper, we propose an e-mail filtering mechanism focused on re-learning and Hyperlink analysis. Our e-mail filter consists of four processes. First, we analyze e-mail formats and classify e-mails into context and Hyperlink category. In the second process, we designed four modules including the recursive link extraction module, the shortened URL restoration module, the Unicode Decode module, and the feature extraction module to analyze the features of a Hyperlink. Then we constructed the improved decision tree algorithm ID3 using the features of the Hyperlink and assign each rule a score. In the third process, we scored each un-categorized e-mails according to the rules established in the second process and classify them to either a spam or a legitimate e-mail. Finally, we use the re-learning mechanism to decrease the rate of False Positive (fp) and solve the problem of time-sensitive spam keywords.
According to the experiment, the overall accuracy of our proposed method could achieve 98.70%, and fp rate could be decreased to below 1.13%. The results indicate that our method could effectively deal with spam with high accuracy, especially for e-mails with a lot of Hyperlinks.
|