Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora

One of the fundamental questions about human language is whether its entropy rate is positive. The entropy rate measures the average amount of information communicated per unit time. The question about the entropy of language dates back to experiments by Shannon in 1951, but in 1990 Hilberg raised d...

Full description

Bibliographic Details
Main Authors: Ryosuke Takahira, Kumiko Tanaka-Ishii, Łukasz Dębowski
Format: Article
Language:English
Published: MDPI AG 2016-10-01
Series:Entropy
Subjects:
Online Access:http://www.mdpi.com/1099-4300/18/10/364
id doaj-c5e79bbc3e8f4ec681f86cfc208bcba5
record_format Article
spelling doaj-c5e79bbc3e8f4ec681f86cfc208bcba52020-11-24T22:50:17ZengMDPI AGEntropy1099-43002016-10-01181036410.3390/e18100364e18100364Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale CorporaRyosuke Takahira0Kumiko Tanaka-Ishii1Łukasz Dębowski2Graduate School of Information Science and Electrical Engineering, Kyushu University, Fukuoka 819-0395, JapanResearch Center for Advanced Science and Technology, University of Tokyo, Tokyo 153-8904, JapanInstitute of Computer Science, Polish Academy of Sciences, Warszawa 01-248, PolandOne of the fundamental questions about human language is whether its entropy rate is positive. The entropy rate measures the average amount of information communicated per unit time. The question about the entropy of language dates back to experiments by Shannon in 1951, but in 1990 Hilberg raised doubt regarding a correct interpretation of these experiments. This article provides an in-depth empirical analysis, using 20 corpora of up to 7.8 gigabytes across six languages (English, French, Russian, Korean, Chinese, and Japanese), to conclude that the entropy rate is positive. To obtain the estimates for data length tending to infinity, we use an extrapolation function given by an ansatz. Whereas some ansatzes were proposed previously, here we use a new stretched exponential extrapolation function that has a smaller error of fit. Thus, we conclude that the entropy rates of human languages are positive but approximately 20% smaller than without extrapolation. Although the entropy rate estimates depend on the script kind, the exponent of the ansatz function turns out to be constant across different languages and governs the complexity of natural language in general. In other words, in spite of typological differences, all languages seem equally hard to learn, which partly confirms Hilberg’s hypothesis.http://www.mdpi.com/1099-4300/18/10/364entropy rateuniversal compressionstretched exponentiallanguage universals
collection DOAJ
language English
format Article
sources DOAJ
author Ryosuke Takahira
Kumiko Tanaka-Ishii
Łukasz Dębowski
spellingShingle Ryosuke Takahira
Kumiko Tanaka-Ishii
Łukasz Dębowski
Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora
Entropy
entropy rate
universal compression
stretched exponential
language universals
author_facet Ryosuke Takahira
Kumiko Tanaka-Ishii
Łukasz Dębowski
author_sort Ryosuke Takahira
title Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora
title_short Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora
title_full Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora
title_fullStr Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora
title_full_unstemmed Entropy Rate Estimates for Natural Language—A New Extrapolation of Compressed Large-Scale Corpora
title_sort entropy rate estimates for natural language—a new extrapolation of compressed large-scale corpora
publisher MDPI AG
series Entropy
issn 1099-4300
publishDate 2016-10-01
description One of the fundamental questions about human language is whether its entropy rate is positive. The entropy rate measures the average amount of information communicated per unit time. The question about the entropy of language dates back to experiments by Shannon in 1951, but in 1990 Hilberg raised doubt regarding a correct interpretation of these experiments. This article provides an in-depth empirical analysis, using 20 corpora of up to 7.8 gigabytes across six languages (English, French, Russian, Korean, Chinese, and Japanese), to conclude that the entropy rate is positive. To obtain the estimates for data length tending to infinity, we use an extrapolation function given by an ansatz. Whereas some ansatzes were proposed previously, here we use a new stretched exponential extrapolation function that has a smaller error of fit. Thus, we conclude that the entropy rates of human languages are positive but approximately 20% smaller than without extrapolation. Although the entropy rate estimates depend on the script kind, the exponent of the ansatz function turns out to be constant across different languages and governs the complexity of natural language in general. In other words, in spite of typological differences, all languages seem equally hard to learn, which partly confirms Hilberg’s hypothesis.
topic entropy rate
universal compression
stretched exponential
language universals
url http://www.mdpi.com/1099-4300/18/10/364
work_keys_str_mv AT ryosuketakahira entropyrateestimatesfornaturallanguageanewextrapolationofcompressedlargescalecorpora
AT kumikotanakaishii entropyrateestimatesfornaturallanguageanewextrapolationofcompressedlargescalecorpora
AT łukaszdebowski entropyrateestimatesfornaturallanguageanewextrapolationofcompressedlargescalecorpora
_version_ 1725673091526295552