A Design of E-mail Classification Method based on Text Mining and Machine Learning

碩士 === 國防大學理工學院 === 資訊工程碩士班 === 100 === In recent years, many things can be done by using computer and Internet, but some troubles incurred by the convenience of computer and network happen to people. For example, an individual or a company must spend much time and many cost to deal with spam. Moreo...

Full description

Bibliographic Details
Main Authors: Lin, YouRen, 林祐任
Other Authors: Yang, ChyiBao
Format: Others
Language:zh-TW
Published: 2012
Online Access:http://ndltd.ncl.edu.tw/handle/72579165234356962198
id ndltd-TW-100CCIT0394020
record_format oai_dc
spelling ndltd-TW-100CCIT03940202015-10-13T21:01:52Z http://ndltd.ncl.edu.tw/handle/72579165234356962198 A Design of E-mail Classification Method based on Text Mining and Machine Learning 以文字探勘及機器學習為基礎之電子郵件分類方法設計 Lin, YouRen 林祐任 碩士 國防大學理工學院 資訊工程碩士班 100 In recent years, many things can be done by using computer and Internet, but some troubles incurred by the convenience of computer and network happen to people. For example, an individual or a company must spend much time and many cost to deal with spam. Moreover, the spam may trojane or poison a computer, which results in the leakage of personal privacy, becoming a zombie computer. Data Mining techniques can be used to classify the data volumes of data to discover the useful information. Using text mining technique can find unknown, hidden, and useful information from non-structure or semi-structure texts. Therefore, this study hopes to combine both the data mining and text mining techniques to detect and classify the possible spam from volumes of E-mails. This study proposed a new feature selection methods - TF-PDF(Term Frequency - Proportion Document Frequency)algorithm, which extracts the useful features for classification based on the subjects and contents of E-mails. Then, the features were used in three classifiers, Decision tree, Naïve Bayes, and Support Vector Machine, to train the machine learning models, by which the normal mails and spam can be classified effectively. This study focuses on the performance of feature selection in text mining. Ling Spam dataset is used in the experiments. By ten-fold cross validation, the classification accuracies with Decision tree, Naïve Bayes, and Support Vector Machine classifiers among the TF-IDF, TF-G/G+L, and the TF-PDF algorithms are compared. The experimental results showed that the overall accuracies of the classification models established by the features of the TF-PDF algorithm are better than those established by the features of the other two feature selection algorithms. Yang, ChyiBao 楊棋堡 2012 學位論文 ; thesis 51 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國防大學理工學院 === 資訊工程碩士班 === 100 === In recent years, many things can be done by using computer and Internet, but some troubles incurred by the convenience of computer and network happen to people. For example, an individual or a company must spend much time and many cost to deal with spam. Moreover, the spam may trojane or poison a computer, which results in the leakage of personal privacy, becoming a zombie computer. Data Mining techniques can be used to classify the data volumes of data to discover the useful information. Using text mining technique can find unknown, hidden, and useful information from non-structure or semi-structure texts. Therefore, this study hopes to combine both the data mining and text mining techniques to detect and classify the possible spam from volumes of E-mails. This study proposed a new feature selection methods - TF-PDF(Term Frequency - Proportion Document Frequency)algorithm, which extracts the useful features for classification based on the subjects and contents of E-mails. Then, the features were used in three classifiers, Decision tree, Naïve Bayes, and Support Vector Machine, to train the machine learning models, by which the normal mails and spam can be classified effectively. This study focuses on the performance of feature selection in text mining. Ling Spam dataset is used in the experiments. By ten-fold cross validation, the classification accuracies with Decision tree, Naïve Bayes, and Support Vector Machine classifiers among the TF-IDF, TF-G/G+L, and the TF-PDF algorithms are compared. The experimental results showed that the overall accuracies of the classification models established by the features of the TF-PDF algorithm are better than those established by the features of the other two feature selection algorithms.
author2 Yang, ChyiBao
author_facet Yang, ChyiBao
Lin, YouRen
林祐任
author Lin, YouRen
林祐任
spellingShingle Lin, YouRen
林祐任
A Design of E-mail Classification Method based on Text Mining and Machine Learning
author_sort Lin, YouRen
title A Design of E-mail Classification Method based on Text Mining and Machine Learning
title_short A Design of E-mail Classification Method based on Text Mining and Machine Learning
title_full A Design of E-mail Classification Method based on Text Mining and Machine Learning
title_fullStr A Design of E-mail Classification Method based on Text Mining and Machine Learning
title_full_unstemmed A Design of E-mail Classification Method based on Text Mining and Machine Learning
title_sort design of e-mail classification method based on text mining and machine learning
publishDate 2012
url http://ndltd.ncl.edu.tw/handle/72579165234356962198
work_keys_str_mv AT linyouren adesignofemailclassificationmethodbasedontextminingandmachinelearning
AT línyòurèn adesignofemailclassificationmethodbasedontextminingandmachinelearning
AT linyouren yǐwénzìtànkānjíjīqìxuéxíwèijīchǔzhīdiànziyóujiànfēnlèifāngfǎshèjì
AT línyòurèn yǐwénzìtànkānjíjīqìxuéxíwèijīchǔzhīdiànziyóujiànfēnlèifāngfǎshèjì
AT linyouren designofemailclassificationmethodbasedontextminingandmachinelearning
AT línyòurèn designofemailclassificationmethodbasedontextminingandmachinelearning
_version_ 1718053145899171840