Summary: | 碩士 === 銘傳大學 === 資訊傳播工程學系碩士班 === 94 === Support Vector Machine (SVM) and Naïve Bayes are well-known machine-learning algorithms for the application of content filtering against spam. On the basis of fast classification through the hyper-plane of SVM and flexible threshold setting of Bayes, in this thesis we proposed a two-tier filtering scheme which combine SVM and new Naïve Bayes model for anti-spam. In the first tier, Information Gain is the way to decide keywords for training vector of SVM. This thesis also defines four margin of the hyper-plane and pick out the testing data which locate on the scope for the second tier Bayesian probability calculation, in order to decide the category of sample data. As the results of our research which indicated that all kinds of the margin setting bring the improved accuracy (Accur) about 1%~4%, especially the Maximum Distance and Average Distance Margin. Additionally, the optimal model performs the total accuracy of Chinese and English data above 97%. However, the proposed two-tier filtering scheme and new Naïve Bayes model were verified with availability and contribution.
|