Stylistics versus statistics : a corpus linguistic approach to combining techniques in forensic authorship analysis using Enron emails
This thesis empirically investigates how a corpus linguistic approach can address the main theoretical and methodological challenges facing the field of forensic authorship analysis. Linguists approach the problem of questioned authorship from the theoretical position that each person has their own...
Main Author: | |
---|---|
Other Authors: | |
Published: |
University of Leeds
2014
|
Subjects: | |
Online Access: | http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.640624 |
id |
ndltd-bl.uk-oai-ethos.bl.uk-640624 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-bl.uk-oai-ethos.bl.uk-6406242017-10-04T03:34:56ZStylistics versus statistics : a corpus linguistic approach to combining techniques in forensic authorship analysis using Enron emailsWright, DavidJohnson, Alison2014This thesis empirically investigates how a corpus linguistic approach can address the main theoretical and methodological challenges facing the field of forensic authorship analysis. Linguists approach the problem of questioned authorship from the theoretical position that each person has their own distinctive idiolect (Coulthard 2004: 431). However, the notion of idiolect has come under scrutiny in forensic linguistics over recent years for being too abstract to be of practical use (Grant 2010; Turell 2010). At the same time, two competing methodologies have developed in authorship analysis. On the one hand, there are qualitative stylistic approaches, and on the other there are statistical ‘stylometric’ techniques. This study uses a corpus of over 60,000 emails and 2.5 million words written by 176 employees of the former American company Enron to tackle these issues in the contexts of both authorship attribution (identifying authors using linguistic evidence) and author profiling (predicting authors’ social characteristics using linguistic evidence). Analyses reveal that even in shared communicative contexts, and when using very common lexical items, individual Enron employees produce distinctive collocation patterns and lexical co-selections. In turn, these idiolectal elements of linguistic output can be captured and quantified by word n-grams (strings of n words). An attribution experiment is performed using word n-grams to identify the authors of anonymised email samples. Results of the experiment are encouraging, and it is argued that the approach developed here offers a means by which stylistic and statistical techniques can complement each other. Finally, quantitative and qualitative analyses are combined in the sociolinguistic profiling of Enron employees by gender and occupation. Current author profiling research is exclusively statistical in nature. However, the findings here demonstrate that when statistical results are augmented by qualitative evidence, the complex relationship between language use and author identity can be more accurately observed.410.1University of Leedshttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.640624http://etheses.whiterose.ac.uk/8278/Electronic Thesis or Dissertation |
collection |
NDLTD |
sources |
NDLTD |
topic |
410.1 |
spellingShingle |
410.1 Wright, David Stylistics versus statistics : a corpus linguistic approach to combining techniques in forensic authorship analysis using Enron emails |
description |
This thesis empirically investigates how a corpus linguistic approach can address the main theoretical and methodological challenges facing the field of forensic authorship analysis. Linguists approach the problem of questioned authorship from the theoretical position that each person has their own distinctive idiolect (Coulthard 2004: 431). However, the notion of idiolect has come under scrutiny in forensic linguistics over recent years for being too abstract to be of practical use (Grant 2010; Turell 2010). At the same time, two competing methodologies have developed in authorship analysis. On the one hand, there are qualitative stylistic approaches, and on the other there are statistical ‘stylometric’ techniques. This study uses a corpus of over 60,000 emails and 2.5 million words written by 176 employees of the former American company Enron to tackle these issues in the contexts of both authorship attribution (identifying authors using linguistic evidence) and author profiling (predicting authors’ social characteristics using linguistic evidence). Analyses reveal that even in shared communicative contexts, and when using very common lexical items, individual Enron employees produce distinctive collocation patterns and lexical co-selections. In turn, these idiolectal elements of linguistic output can be captured and quantified by word n-grams (strings of n words). An attribution experiment is performed using word n-grams to identify the authors of anonymised email samples. Results of the experiment are encouraging, and it is argued that the approach developed here offers a means by which stylistic and statistical techniques can complement each other. Finally, quantitative and qualitative analyses are combined in the sociolinguistic profiling of Enron employees by gender and occupation. Current author profiling research is exclusively statistical in nature. However, the findings here demonstrate that when statistical results are augmented by qualitative evidence, the complex relationship between language use and author identity can be more accurately observed. |
author2 |
Johnson, Alison |
author_facet |
Johnson, Alison Wright, David |
author |
Wright, David |
author_sort |
Wright, David |
title |
Stylistics versus statistics : a corpus linguistic approach to combining techniques in forensic authorship analysis using Enron emails |
title_short |
Stylistics versus statistics : a corpus linguistic approach to combining techniques in forensic authorship analysis using Enron emails |
title_full |
Stylistics versus statistics : a corpus linguistic approach to combining techniques in forensic authorship analysis using Enron emails |
title_fullStr |
Stylistics versus statistics : a corpus linguistic approach to combining techniques in forensic authorship analysis using Enron emails |
title_full_unstemmed |
Stylistics versus statistics : a corpus linguistic approach to combining techniques in forensic authorship analysis using Enron emails |
title_sort |
stylistics versus statistics : a corpus linguistic approach to combining techniques in forensic authorship analysis using enron emails |
publisher |
University of Leeds |
publishDate |
2014 |
url |
http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.640624 |
work_keys_str_mv |
AT wrightdavid stylisticsversusstatisticsacorpuslinguisticapproachtocombiningtechniquesinforensicauthorshipanalysisusingenronemails |
_version_ |
1718545335172726784 |