Summary: | This thesis describes work on the detection of anomalous material in text without the use of training data. We use the term anomalous to refer to text that is irregular, or deviates signihcantly from its surrounding context. In this thesis we show to identifying such abnormalities in text can be viewed as a type of outlier detection because these anomahes will differ significantly from the writing style in the majority We consider segments of text, which are anomalous with respect to topic about a different subject, author (written by a different person), or genre (written for a different audience or from a different source) and experiment with whether it is possible to identify these anomalous segments automatically. Five different innovative approaches to this problem are introduced and assessed using many experiments ver large document collections, created to contain randomly inserted anomalous segments. In order to identify anomalies in text successfully, we investigate and evaluate 166 stylistic and linguistic features used to characterize writing, some of which are well-established stylistic determiners, but many of which are original. Using these features with each of our methods, we examine the effect of segment size on our ability to detect anomaly, allowing segments of size 100 words, 500 words and 1000 words. We show substantial improvements over a baseline in all cases for all methods, a novel method which performs consistently better than others and the features that contribute most to unsupervised anomaly detection.
|