Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages

Natural language is a remarkable example of a complex dynamical system which combines variation and universal structure emerging from the interaction of millions of individuals. Understanding statistical properties of texts is not only crucial in applications of information retrieval and natural lan...

Full description

Bibliographic Details
Main Author:	Gerlach, Martin
Other Authors:	Technische Universität Dresden, Fakultät Mathematik und Naturwissenschaften
Format:	Doctoral Thesis
Language:	English
Published:	Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden 2016
Subjects:	Komplexe Systeme Physik Natürliche Sprache Quantitative Linguistik Datenanalyse Complex Systems Physics Natural Language Quantitative Linguistics Data Analysis ddc:530 rvk:SK 820
Online Access:	http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-199083 http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-199083 http://www.qucosa.de/fileadmin/data/qucosa/documents/19908/Thesis_GerlachMartin.pdf

id	ndltd-DRESDEN-oai-qucosa.de-bsz-14-qucosa-199083
record_format	oai_dc
collection	NDLTD
language	English
format	Doctoral Thesis
sources	NDLTD
topic	Komplexe Systeme Physik Natürliche Sprache Quantitative Linguistik Datenanalyse Complex Systems Physics Natural Language Quantitative Linguistics Data Analysis ddc:530 rvk:SK 820
spellingShingle	Komplexe Systeme Physik Natürliche Sprache Quantitative Linguistik Datenanalyse Complex Systems Physics Natural Language Quantitative Linguistics Data Analysis ddc:530 rvk:SK 820 Gerlach, Martin Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages
description	Natural language is a remarkable example of a complex dynamical system which combines variation and universal structure emerging from the interaction of millions of individuals. Understanding statistical properties of texts is not only crucial in applications of information retrieval and natural language processing, e.g. search engines, but also allow deeper insights into the organization of knowledge in the form of written text. In this thesis, we investigate the statistical and dynamical processes underlying the co-existence of universality and variability in word statistics. We combine a careful statistical analysis of large empirical databases on language usage with analytical and numerical studies of stochastic models. We find that the fat-tailed distribution of word frequencies is best described by a generalized Zipf’s law characterized by two scaling regimes, in which the values of the parameters are extremely robust with respect to time as well as the type and the size of the database under consideration depending only on the particular language. We provide an interpretation of the two regimes in terms of a distinction of words into a finite core vocabulary and a (virtually) infinite noncore vocabulary. Proposing a simple generative process of language usage, we can establish the connection to the problem of the vocabulary growth, i.e. how the number of different words scale with the database size, from which we obtain a unified perspective on different universal scaling laws simultaneously appearing in the statistics of natural language. On the one hand, our stochastic model accurately predicts the expected number of different items as measured in empirical data spanning hundreds of years and 9 orders of magnitude in size showing that the supposed vocabulary growth over time is mainly driven by database size and not by a change in vocabulary richness. On the other hand, analysis of the variation around the expected size of the vocabulary shows anomalous fluctuation scaling, i.e. the vocabulary is a nonself-averaging quantity, and therefore, fluctuations are much larger than expected. We derive how this results from topical variations in a collection of texts coming from different authors, disciplines, or times manifest in the form of correlations of frequencies of different words due to their semantic relation. We explore the consequences of topical variation in applications to language change and topic models emphasizing the difficulties (and presenting possible solutions) due to the fact that the statistics of word frequencies are characterized by a fat-tailed distribution. First, we propose an information-theoretic measure based on the Shannon-Gibbs entropy and suitable generalizations quantifying the similarity between different texts which allows us to determine how fast the vocabulary of a language changes over time. Second, we combine topic models from machine learning with concepts from community detection in complex networks in order to infer large-scale (mesoscopic) structures in a collection of texts. Finally, we study language change of individual words on historical time scales, i.e. how a linguistic innovation spreads through a community of speakers, providing a framework to quantitatively combine microscopic models of language change with empirical data that is only available on a macroscopic level (i.e. averaged over the population of speakers).
author2	Technische Universität Dresden, Fakultät Mathematik und Naturwissenschaften
author_facet	Technische Universität Dresden, Fakultät Mathematik und Naturwissenschaften Gerlach, Martin
author	Gerlach, Martin
author_sort	Gerlach, Martin
title	Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages
title_short	Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages
title_full	Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages
title_fullStr	Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages
title_full_unstemmed	Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages
title_sort	universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages
publisher	Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden
publishDate	2016
url	http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-199083 http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-199083 http://www.qucosa.de/fileadmin/data/qucosa/documents/19908/Thesis_GerlachMartin.pdf
work_keys_str_mv	AT gerlachmartin universalityandvariabilityinthestatisticsofdatawithfattaileddistributionsthecaseofwordfrequenciesinnaturallanguages
_version_	1718202681991888896
spelling	ndltd-DRESDEN-oai-qucosa.de-bsz-14-qucosa-1990832016-03-11T03:28:48Z Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages Gerlach, Martin Komplexe Systeme Physik Natürliche Sprache Quantitative Linguistik Datenanalyse Complex Systems Physics Natural Language Quantitative Linguistics Data Analysis ddc:530 rvk:SK 820 Natural language is a remarkable example of a complex dynamical system which combines variation and universal structure emerging from the interaction of millions of individuals. Understanding statistical properties of texts is not only crucial in applications of information retrieval and natural language processing, e.g. search engines, but also allow deeper insights into the organization of knowledge in the form of written text. In this thesis, we investigate the statistical and dynamical processes underlying the co-existence of universality and variability in word statistics. We combine a careful statistical analysis of large empirical databases on language usage with analytical and numerical studies of stochastic models. We find that the fat-tailed distribution of word frequencies is best described by a generalized Zipf’s law characterized by two scaling regimes, in which the values of the parameters are extremely robust with respect to time as well as the type and the size of the database under consideration depending only on the particular language. We provide an interpretation of the two regimes in terms of a distinction of words into a finite core vocabulary and a (virtually) infinite noncore vocabulary. Proposing a simple generative process of language usage, we can establish the connection to the problem of the vocabulary growth, i.e. how the number of different words scale with the database size, from which we obtain a unified perspective on different universal scaling laws simultaneously appearing in the statistics of natural language. On the one hand, our stochastic model accurately predicts the expected number of different items as measured in empirical data spanning hundreds of years and 9 orders of magnitude in size showing that the supposed vocabulary growth over time is mainly driven by database size and not by a change in vocabulary richness. On the other hand, analysis of the variation around the expected size of the vocabulary shows anomalous fluctuation scaling, i.e. the vocabulary is a nonself-averaging quantity, and therefore, fluctuations are much larger than expected. We derive how this results from topical variations in a collection of texts coming from different authors, disciplines, or times manifest in the form of correlations of frequencies of different words due to their semantic relation. We explore the consequences of topical variation in applications to language change and topic models emphasizing the difficulties (and presenting possible solutions) due to the fact that the statistics of word frequencies are characterized by a fat-tailed distribution. First, we propose an information-theoretic measure based on the Shannon-Gibbs entropy and suitable generalizations quantifying the similarity between different texts which allows us to determine how fast the vocabulary of a language changes over time. Second, we combine topic models from machine learning with concepts from community detection in complex networks in order to infer large-scale (mesoscopic) structures in a collection of texts. Finally, we study language change of individual words on historical time scales, i.e. how a linguistic innovation spreads through a community of speakers, providing a framework to quantitatively combine microscopic models of language change with empirical data that is only available on a macroscopic level (i.e. averaged over the population of speakers). Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden Technische Universität Dresden, Fakultät Mathematik und Naturwissenschaften Prof. Dr. Jan-Michael Rost Prof. Dr. Jan-Michael Rost Prof. Dr. Roland Ketzmerick Prof. Alvaro Corral 2016-03-10 doc-type:doctoralThesis application/pdf http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-199083 urn:nbn:de:bsz:14-qucosa-199083 http://www.qucosa.de/fileadmin/data/qucosa/documents/19908/Thesis_GerlachMartin.pdf eng

Universality and variability in the statistics of data with fat-tailed distributions: the case of word frequencies in natural languages

Similar Items