Summary: | 碩士 === 國防大學理工學院 === 資訊工程碩士班 === 100 === In recent years, many things can be done by using computer and Internet, but some troubles incurred by the convenience of computer and network happen to people. For example, an individual or a company must spend much time and many cost to deal with spam. Moreover, the spam may trojane or poison a computer, which results in the leakage of personal privacy, becoming a zombie computer. Data Mining techniques can be used to classify the data volumes of data to discover the useful information. Using text mining technique can find unknown, hidden, and useful information from non-structure or semi-structure texts. Therefore, this study hopes to combine both the data mining and text mining techniques to detect and classify the possible spam from volumes of E-mails.
This study proposed a new feature selection methods - TF-PDF(Term Frequency - Proportion Document Frequency)algorithm, which extracts the useful features for classification based on the subjects and contents of E-mails. Then, the features were used in three classifiers, Decision tree, Naïve Bayes, and Support Vector Machine, to train the machine learning models, by which the normal mails and spam can be classified effectively. This study focuses on the performance of feature selection in text mining. Ling Spam dataset is used in the experiments. By ten-fold cross validation, the classification accuracies with Decision tree, Naïve Bayes, and Support Vector Machine classifiers among the TF-IDF, TF-G/G+L, and the TF-PDF algorithms are compared. The experimental results showed that the overall accuracies of the classification models established by the features of the TF-PDF algorithm are better than those established by the features of the other two feature selection algorithms.
|