A novel method of stylometry based on the statistic of numerals

A new method of statistical analysis of texts is suggested. The frequency distribution of the first significant digits in numerals of English-language texts is considered. We have taken into account cardinal as well as ordinal numerals expressed both in figures, and verbally. To identify the authors...

Full description

Bibliographic Details
Main Author: Andrei Viacheslavovich Zenkov
Format: Article
Language:Russian
Published: Institute of Computer Science 2017-10-01
Series:Компьютерные исследования и моделирование
Subjects:
Online Access:http://crm.ics.org.ru/uploads/crmissues/crm_2017_5/2017_05_12.pdf
id doaj-d2610508c4914af18e53adb4215ee4d4
record_format Article
spelling doaj-d2610508c4914af18e53adb4215ee4d42020-11-24T20:53:43ZrusInstitute of Computer ScienceКомпьютерные исследования и моделирование2076-76332077-68532017-10-019583785010.20537/2076-7633-2017-9-5-837-8502627A novel method of stylometry based on the statistic of numeralsAndrei Viacheslavovich ZenkovA new method of statistical analysis of texts is suggested. The frequency distribution of the first significant digits in numerals of English-language texts is considered. We have taken into account cardinal as well as ordinal numerals expressed both in figures, and verbally. To identify the authors use of numerals, we previously deleted from the text all idiomatic expressions and set phrases accidentally containing numerals, as well as itemizations and page numbers, etc. Benfords law is found to hold approximately for the frequencies of various first significant digits of compound literary texts by different authors; a marked predominance of the digit 1 is observed. In coherent authorial texts, characteristic deviations from Benfords law arise which are statistically stable significant author peculiarities that allow, under certain conditions, to consider the problem of authorship and distinguish between texts by different authors. The text should be large enough (at least about 200 kB). At the end of $\{1, 2, \ldots, 9\}$ digits row, the frequency distribution is subject to strong fluctuations and thus unrepresentative for our purpose. The aim of the theoretical explanation of the observed empirical regularity is not intended, which, however, does not preclude the applicability of the proposed methodology for text attribution. The approach suggested and the conclusions are backed by the examples of the computer analysis of works by W.M. Thackeray, M. Twain, R. L. Stevenson, J. Joyce, sisters Bront¨e, and J.Austen. On the basis of technique suggested, we examined the authorship of a text earlier ascribed to L. F. Baum (the result agrees with that obtained by different means). We have shown that the authorship of Harper Lees "To Kill a Mockingbird" pertains to her, whereas the primary draft, "Go Set a Watchman", seems to have been written in collaboration with Truman Capote. All results are confirmed on the basis of parametric Pearsons chi-squared test as well as non-parametric Mann -Whitney U test and Kruskal -Wallis test.http://crm.ics.org.ru/uploads/crmissues/crm_2017_5/2017_05_12.pdftext attributionfirst significant digit of numerals
collection DOAJ
language Russian
format Article
sources DOAJ
author Andrei Viacheslavovich Zenkov
spellingShingle Andrei Viacheslavovich Zenkov
A novel method of stylometry based on the statistic of numerals
Компьютерные исследования и моделирование
text attribution
first significant digit of numerals
author_facet Andrei Viacheslavovich Zenkov
author_sort Andrei Viacheslavovich Zenkov
title A novel method of stylometry based on the statistic of numerals
title_short A novel method of stylometry based on the statistic of numerals
title_full A novel method of stylometry based on the statistic of numerals
title_fullStr A novel method of stylometry based on the statistic of numerals
title_full_unstemmed A novel method of stylometry based on the statistic of numerals
title_sort novel method of stylometry based on the statistic of numerals
publisher Institute of Computer Science
series Компьютерные исследования и моделирование
issn 2076-7633
2077-6853
publishDate 2017-10-01
description A new method of statistical analysis of texts is suggested. The frequency distribution of the first significant digits in numerals of English-language texts is considered. We have taken into account cardinal as well as ordinal numerals expressed both in figures, and verbally. To identify the authors use of numerals, we previously deleted from the text all idiomatic expressions and set phrases accidentally containing numerals, as well as itemizations and page numbers, etc. Benfords law is found to hold approximately for the frequencies of various first significant digits of compound literary texts by different authors; a marked predominance of the digit 1 is observed. In coherent authorial texts, characteristic deviations from Benfords law arise which are statistically stable significant author peculiarities that allow, under certain conditions, to consider the problem of authorship and distinguish between texts by different authors. The text should be large enough (at least about 200 kB). At the end of $\{1, 2, \ldots, 9\}$ digits row, the frequency distribution is subject to strong fluctuations and thus unrepresentative for our purpose. The aim of the theoretical explanation of the observed empirical regularity is not intended, which, however, does not preclude the applicability of the proposed methodology for text attribution. The approach suggested and the conclusions are backed by the examples of the computer analysis of works by W.M. Thackeray, M. Twain, R. L. Stevenson, J. Joyce, sisters Bront¨e, and J.Austen. On the basis of technique suggested, we examined the authorship of a text earlier ascribed to L. F. Baum (the result agrees with that obtained by different means). We have shown that the authorship of Harper Lees "To Kill a Mockingbird" pertains to her, whereas the primary draft, "Go Set a Watchman", seems to have been written in collaboration with Truman Capote. All results are confirmed on the basis of parametric Pearsons chi-squared test as well as non-parametric Mann -Whitney U test and Kruskal -Wallis test.
topic text attribution
first significant digit of numerals
url http://crm.ics.org.ru/uploads/crmissues/crm_2017_5/2017_05_12.pdf
work_keys_str_mv AT andreiviacheslavovichzenkov anovelmethodofstylometrybasedonthestatisticofnumerals
AT andreiviacheslavovichzenkov novelmethodofstylometrybasedonthestatisticofnumerals
_version_ 1716796407415832576