Email Mining Classifier : The empirical study on combining the topic modelling with Random Forest classification

Filtering out and replying automatically to emails are of interest to many but is hard due to the complexity of the language and to dependencies of background information that is not present in the email itself. This paper investigates whether Latent Dirichlet Allocation (LDA) combined with Random F...

Full description

Bibliographic Details
Main Author: Halmann, Marju
Format: Others
Language:English
Published: Högskolan i Skövde, Institutionen för informationsteknologi 2017
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-14710
id ndltd-UPSALLA1-oai-DiVA.org-his-14710
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-his-147102018-02-17T05:18:04ZEmail Mining Classifier : The empirical study on combining the topic modelling with Random Forest classificationengHalmann, MarjuHögskolan i Skövde, Institutionen för informationsteknologi2017Email miningLatent Dirichlet AllocationRandom Forest classificationComputer SciencesDatavetenskap (datalogi)Filtering out and replying automatically to emails are of interest to many but is hard due to the complexity of the language and to dependencies of background information that is not present in the email itself. This paper investigates whether Latent Dirichlet Allocation (LDA) combined with Random Forest classifier can be used for the more general email classification task and how it compares to other existing email classifiers. The comparison is based on the literature study and on the empirical experimentation using two real-life datasets. Firstly, a literature study is performed to gain insight of the accuracy of other available email classifiers. Secondly, proposed model’s accuracy is explored with experimentation. The literature study shows that the accuracy of more general email classifiers differs greatly on different user sets. The proposed model accuracy is within the reported accuracy range, however in the lower part. It indicates that the proposed model performs poorly compared to other classifiers. On average, the classifier performance improves 15 percentage points with additional information. This indicates that Latent Dirichlet Allocation (LDA) combined with Random Forest classifier is promising, however future studies are needed to explore the model and ways to further increase the accuracy.  Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-14710application/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Others
sources NDLTD
topic Email mining
Latent Dirichlet Allocation
Random Forest classification
Computer Sciences
Datavetenskap (datalogi)
spellingShingle Email mining
Latent Dirichlet Allocation
Random Forest classification
Computer Sciences
Datavetenskap (datalogi)
Halmann, Marju
Email Mining Classifier : The empirical study on combining the topic modelling with Random Forest classification
description Filtering out and replying automatically to emails are of interest to many but is hard due to the complexity of the language and to dependencies of background information that is not present in the email itself. This paper investigates whether Latent Dirichlet Allocation (LDA) combined with Random Forest classifier can be used for the more general email classification task and how it compares to other existing email classifiers. The comparison is based on the literature study and on the empirical experimentation using two real-life datasets. Firstly, a literature study is performed to gain insight of the accuracy of other available email classifiers. Secondly, proposed model’s accuracy is explored with experimentation. The literature study shows that the accuracy of more general email classifiers differs greatly on different user sets. The proposed model accuracy is within the reported accuracy range, however in the lower part. It indicates that the proposed model performs poorly compared to other classifiers. On average, the classifier performance improves 15 percentage points with additional information. This indicates that Latent Dirichlet Allocation (LDA) combined with Random Forest classifier is promising, however future studies are needed to explore the model and ways to further increase the accuracy. 
author Halmann, Marju
author_facet Halmann, Marju
author_sort Halmann, Marju
title Email Mining Classifier : The empirical study on combining the topic modelling with Random Forest classification
title_short Email Mining Classifier : The empirical study on combining the topic modelling with Random Forest classification
title_full Email Mining Classifier : The empirical study on combining the topic modelling with Random Forest classification
title_fullStr Email Mining Classifier : The empirical study on combining the topic modelling with Random Forest classification
title_full_unstemmed Email Mining Classifier : The empirical study on combining the topic modelling with Random Forest classification
title_sort email mining classifier : the empirical study on combining the topic modelling with random forest classification
publisher Högskolan i Skövde, Institutionen för informationsteknologi
publishDate 2017
url http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-14710
work_keys_str_mv AT halmannmarju emailminingclassifiertheempiricalstudyoncombiningthetopicmodellingwithrandomforestclassification
_version_ 1718614640395550720