A Design of E-mail Classification Method based on Text Mining and Machine Learning
碩士 === 國防大學理工學院 === 資訊工程碩士班 === 100 === In recent years, many things can be done by using computer and Internet, but some troubles incurred by the convenience of computer and network happen to people. For example, an individual or a company must spend much time and many cost to deal with spam. Moreo...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
2012
|
Online Access: | http://ndltd.ncl.edu.tw/handle/72579165234356962198 |
id |
ndltd-TW-100CCIT0394020 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-100CCIT03940202015-10-13T21:01:52Z http://ndltd.ncl.edu.tw/handle/72579165234356962198 A Design of E-mail Classification Method based on Text Mining and Machine Learning 以文字探勘及機器學習為基礎之電子郵件分類方法設計 Lin, YouRen 林祐任 碩士 國防大學理工學院 資訊工程碩士班 100 In recent years, many things can be done by using computer and Internet, but some troubles incurred by the convenience of computer and network happen to people. For example, an individual or a company must spend much time and many cost to deal with spam. Moreover, the spam may trojane or poison a computer, which results in the leakage of personal privacy, becoming a zombie computer. Data Mining techniques can be used to classify the data volumes of data to discover the useful information. Using text mining technique can find unknown, hidden, and useful information from non-structure or semi-structure texts. Therefore, this study hopes to combine both the data mining and text mining techniques to detect and classify the possible spam from volumes of E-mails. This study proposed a new feature selection methods - TF-PDF(Term Frequency - Proportion Document Frequency)algorithm, which extracts the useful features for classification based on the subjects and contents of E-mails. Then, the features were used in three classifiers, Decision tree, Naïve Bayes, and Support Vector Machine, to train the machine learning models, by which the normal mails and spam can be classified effectively. This study focuses on the performance of feature selection in text mining. Ling Spam dataset is used in the experiments. By ten-fold cross validation, the classification accuracies with Decision tree, Naïve Bayes, and Support Vector Machine classifiers among the TF-IDF, TF-G/G+L, and the TF-PDF algorithms are compared. The experimental results showed that the overall accuracies of the classification models established by the features of the TF-PDF algorithm are better than those established by the features of the other two feature selection algorithms. Yang, ChyiBao 楊棋堡 2012 學位論文 ; thesis 51 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國防大學理工學院 === 資訊工程碩士班 === 100 === In recent years, many things can be done by using computer and Internet, but some troubles incurred by the convenience of computer and network happen to people. For example, an individual or a company must spend much time and many cost to deal with spam. Moreover, the spam may trojane or poison a computer, which results in the leakage of personal privacy, becoming a zombie computer. Data Mining techniques can be used to classify the data volumes of data to discover the useful information. Using text mining technique can find unknown, hidden, and useful information from non-structure or semi-structure texts. Therefore, this study hopes to combine both the data mining and text mining techniques to detect and classify the possible spam from volumes of E-mails.
This study proposed a new feature selection methods - TF-PDF(Term Frequency - Proportion Document Frequency)algorithm, which extracts the useful features for classification based on the subjects and contents of E-mails. Then, the features were used in three classifiers, Decision tree, Naïve Bayes, and Support Vector Machine, to train the machine learning models, by which the normal mails and spam can be classified effectively. This study focuses on the performance of feature selection in text mining. Ling Spam dataset is used in the experiments. By ten-fold cross validation, the classification accuracies with Decision tree, Naïve Bayes, and Support Vector Machine classifiers among the TF-IDF, TF-G/G+L, and the TF-PDF algorithms are compared. The experimental results showed that the overall accuracies of the classification models established by the features of the TF-PDF algorithm are better than those established by the features of the other two feature selection algorithms.
|
author2 |
Yang, ChyiBao |
author_facet |
Yang, ChyiBao Lin, YouRen 林祐任 |
author |
Lin, YouRen 林祐任 |
spellingShingle |
Lin, YouRen 林祐任 A Design of E-mail Classification Method based on Text Mining and Machine Learning |
author_sort |
Lin, YouRen |
title |
A Design of E-mail Classification Method based on Text Mining and Machine Learning |
title_short |
A Design of E-mail Classification Method based on Text Mining and Machine Learning |
title_full |
A Design of E-mail Classification Method based on Text Mining and Machine Learning |
title_fullStr |
A Design of E-mail Classification Method based on Text Mining and Machine Learning |
title_full_unstemmed |
A Design of E-mail Classification Method based on Text Mining and Machine Learning |
title_sort |
design of e-mail classification method based on text mining and machine learning |
publishDate |
2012 |
url |
http://ndltd.ncl.edu.tw/handle/72579165234356962198 |
work_keys_str_mv |
AT linyouren adesignofemailclassificationmethodbasedontextminingandmachinelearning AT línyòurèn adesignofemailclassificationmethodbasedontextminingandmachinelearning AT linyouren yǐwénzìtànkānjíjīqìxuéxíwèijīchǔzhīdiànziyóujiànfēnlèifāngfǎshèjì AT línyòurèn yǐwénzìtànkānjíjīqìxuéxíwèijīchǔzhīdiànziyóujiànfēnlèifāngfǎshèjì AT linyouren designofemailclassificationmethodbasedontextminingandmachinelearning AT línyòurèn designofemailclassificationmethodbasedontextminingandmachinelearning |
_version_ |
1718053145899171840 |