Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages

Natural language is a remarkable example of a complex dynamical system which combines variation and universal structure emerging from the interaction of millions of individuals. Understanding statistical properties of texts is not only crucial in applications of information retrieval and natural lan...

Full description

Bibliographic Details
Main Author: Gerlach, Martin
Other Authors: Technische Universität Dresden, Fakultät Mathematik und Naturwissenschaften
Format: Doctoral Thesis
Language:English
Published: Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden 2016
Subjects:
Online Access:http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-199083
http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-199083
http://www.qucosa.de/fileadmin/data/qucosa/documents/19908/Thesis_GerlachMartin.pdf
id ndltd-DRESDEN-oai-qucosa.de-bsz-14-qucosa-199083
record_format oai_dc
collection NDLTD
language English
format Doctoral Thesis
sources NDLTD
topic Komplexe Systeme
Physik
Natürliche Sprache
Quantitative Linguistik
Datenanalyse
Complex Systems
Physics
Natural Language
Quantitative Linguistics
Data Analysis
ddc:530
rvk:SK 820
spellingShingle Komplexe Systeme
Physik
Natürliche Sprache
Quantitative Linguistik
Datenanalyse
Complex Systems
Physics
Natural Language
Quantitative Linguistics
Data Analysis
ddc:530
rvk:SK 820
Gerlach, Martin
Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages
description Natural language is a remarkable example of a complex dynamical system which combines variation and universal structure emerging from the interaction of millions of individuals. Understanding statistical properties of texts is not only crucial in applications of information retrieval and natural language processing, e.g. search engines, but also allow deeper insights into the organization of knowledge in the form of written text. In this thesis, we investigate the statistical and dynamical processes underlying the co-existence of universality and variability in word statistics. We combine a careful statistical analysis of large empirical databases on language usage with analytical and numerical studies of stochastic models. We find that the fat-tailed distribution of word frequencies is best described by a generalized Zipf’s law characterized by two scaling regimes, in which the values of the parameters are extremely robust with respect to time as well as the type and the size of the database under consideration depending only on the particular language. We provide an interpretation of the two regimes in terms of a distinction of words into a finite core vocabulary and a (virtually) infinite noncore vocabulary. Proposing a simple generative process of language usage, we can establish the connection to the problem of the vocabulary growth, i.e. how the number of different words scale with the database size, from which we obtain a unified perspective on different universal scaling laws simultaneously appearing in the statistics of natural language. On the one hand, our stochastic model accurately predicts the expected number of different items as measured in empirical data spanning hundreds of years and 9 orders of magnitude in size showing that the supposed vocabulary growth over time is mainly driven by database size and not by a change in vocabulary richness. On the other hand, analysis of the variation around the expected size of the vocabulary shows anomalous fluctuation scaling, i.e. the vocabulary is a nonself-averaging quantity, and therefore, fluctuations are much larger than expected. We derive how this results from topical variations in a collection of texts coming from different authors, disciplines, or times manifest in the form of correlations of frequencies of different words due to their semantic relation. We explore the consequences of topical variation in applications to language change and topic models emphasizing the difficulties (and presenting possible solutions) due to the fact that the statistics of word frequencies are characterized by a fat-tailed distribution. First, we propose an information-theoretic measure based on the Shannon-Gibbs entropy and suitable generalizations quantifying the similarity between different texts which allows us to determine how fast the vocabulary of a language changes over time. Second, we combine topic models from machine learning with concepts from community detection in complex networks in order to infer large-scale (mesoscopic) structures in a collection of texts. Finally, we study language change of individual words on historical time scales, i.e. how a linguistic innovation spreads through a community of speakers, providing a framework to quantitatively combine microscopic models of language change with empirical data that is only available on a macroscopic level (i.e. averaged over the population of speakers).
author2 Technische Universität Dresden, Fakultät Mathematik und Naturwissenschaften
author_facet Technische Universität Dresden, Fakultät Mathematik und Naturwissenschaften
Gerlach, Martin
author Gerlach, Martin
author_sort Gerlach, Martin
title Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages
title_short Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages
title_full Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages
title_fullStr Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages
title_full_unstemmed Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages
title_sort universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages
publisher Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden
publishDate 2016
url http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-199083
http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-199083
http://www.qucosa.de/fileadmin/data/qucosa/documents/19908/Thesis_GerlachMartin.pdf
work_keys_str_mv AT gerlachmartin universalityandvariabilityinthestatisticsofdatawithfattaileddistributionsthecaseofwordfrequenciesinnaturallanguages
_version_ 1718202681991888896
spelling ndltd-DRESDEN-oai-qucosa.de-bsz-14-qucosa-1990832016-03-11T03:28:48Z Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages Gerlach, Martin Komplexe Systeme Physik Natürliche Sprache Quantitative Linguistik Datenanalyse Complex Systems Physics Natural Language Quantitative Linguistics Data Analysis ddc:530 rvk:SK 820 Natural language is a remarkable example of a complex dynamical system which combines variation and universal structure emerging from the interaction of millions of individuals. Understanding statistical properties of texts is not only crucial in applications of information retrieval and natural language processing, e.g. search engines, but also allow deeper insights into the organization of knowledge in the form of written text. In this thesis, we investigate the statistical and dynamical processes underlying the co-existence of universality and variability in word statistics. We combine a careful statistical analysis of large empirical databases on language usage with analytical and numerical studies of stochastic models. We find that the fat-tailed distribution of word frequencies is best described by a generalized Zipf’s law characterized by two scaling regimes, in which the values of the parameters are extremely robust with respect to time as well as the type and the size of the database under consideration depending only on the particular language. We provide an interpretation of the two regimes in terms of a distinction of words into a finite core vocabulary and a (virtually) infinite noncore vocabulary. Proposing a simple generative process of language usage, we can establish the connection to the problem of the vocabulary growth, i.e. how the number of different words scale with the database size, from which we obtain a unified perspective on different universal scaling laws simultaneously appearing in the statistics of natural language. On the one hand, our stochastic model accurately predicts the expected number of different items as measured in empirical data spanning hundreds of years and 9 orders of magnitude in size showing that the supposed vocabulary growth over time is mainly driven by database size and not by a change in vocabulary richness. On the other hand, analysis of the variation around the expected size of the vocabulary shows anomalous fluctuation scaling, i.e. the vocabulary is a nonself-averaging quantity, and therefore, fluctuations are much larger than expected. We derive how this results from topical variations in a collection of texts coming from different authors, disciplines, or times manifest in the form of correlations of frequencies of different words due to their semantic relation. We explore the consequences of topical variation in applications to language change and topic models emphasizing the difficulties (and presenting possible solutions) due to the fact that the statistics of word frequencies are characterized by a fat-tailed distribution. First, we propose an information-theoretic measure based on the Shannon-Gibbs entropy and suitable generalizations quantifying the similarity between different texts which allows us to determine how fast the vocabulary of a language changes over time. Second, we combine topic models from machine learning with concepts from community detection in complex networks in order to infer large-scale (mesoscopic) structures in a collection of texts. Finally, we study language change of individual words on historical time scales, i.e. how a linguistic innovation spreads through a community of speakers, providing a framework to quantitatively combine microscopic models of language change with empirical data that is only available on a macroscopic level (i.e. averaged over the population of speakers). Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden Technische Universität Dresden, Fakultät Mathematik und Naturwissenschaften Prof. Dr. Jan-Michael Rost Prof. Dr. Jan-Michael Rost Prof. Dr. Roland Ketzmerick Prof. Alvaro Corral 2016-03-10 doc-type:doctoralThesis application/pdf http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-199083 urn:nbn:de:bsz:14-qucosa-199083 http://www.qucosa.de/fileadmin/data/qucosa/documents/19908/Thesis_GerlachMartin.pdf eng