Authorship attribution in the e-mail domain a study of the effect of size of author corpus and topic on accuracy of identification

Approved for public release; distribution is unlimited. === We determined that it is possible to achieve authorship attribution in the e-mail domain when training on "ersonal" e-mails and testing on "work" e-mails and vice versa. These results are unique since they simulate two d...

Full description

Bibliographic Details
Main Author: Levy-Minzie, Kori.
Other Authors: Martell, Craig
Published: Monterey, California. Naval Postgraduate School 2012
Online Access:http://hdl.handle.net/10945/5780
Description
Summary:Approved for public release; distribution is unlimited. === We determined that it is possible to achieve authorship attribution in the e-mail domain when training on "ersonal" e-mails and testing on "work" e-mails and vice versa. These results are unique since they simulate two different e-mail addresses belonging to the same person where the topic of the e-mails from the two different addresses do not intersect. As we only used one classification technique, these results are preliminary and may serve as a baseline for future work in this area. The corpus of data was the entirety of the Enron corpus as well as a subsection of hand-annotated work and personal e-mails. We discovered that there is enough author signal in each class to identify an author in a sea of noise. We included suggestions for future work in the areas of expanding feature selection, increasing corpus size, and including more classification methods. Advancement in this area will contribute to increasing cyber security by identifying the senders of anonymous derogatory e-mails and reducing cyber bullying.