Summary: | Author Attribution (AA) is a critical stylometry problem that tries to deduce the identity of the authors of electronic texts (e-texts) by only examining the texts. AA is essential for enhancing various application domains, such as recommender systems and forensics. Nevertheless, existing techniques in AA have not been assessed with Emirati social media e-texts. The reason is that no suitable dataset exists for evaluating AA techniques in this context. This paper introduces the Khonji-Iraqi Emirati Tweets Author Identification (AID) dataset with 30 authors (KIT-30), and detailed evaluations. Compound grams, a new definition of grams, are introduced, which allows us to achieve higher classification accuracy. Also, when the number of suspect authors increases, the classification accuracy degradation is not as severe as previously reported, when using suitable data representation. Furthermore, in order to work towards addressing the lack of conveniently-available implementations of stylometry methods, we have developed an extensive e-text feature extraction library, namely Fextractor, with a highly intuitive API. The library generalizes all existing n-gram-based feature extraction methods under the at least l-frequent, dir-directed, k-skipped n-grams, and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as Part of Speech (POS) tags, as well as lower-level ones, such as the distribution of function words and word shapes.
|