Languages in China provinces: quantitative estimation with incomplete data

This paper formulates and solves a practical problem of data recovery regarding the distribution of languages on regional level in context of China. The necessity of this recovery is related to the problem of the determination of the linguistic diversity indices, which, in turn, are used to analyze...

Full description

Bibliographic Details
Main Authors: Denis V. Davydov, Aleksandr B. Shapoval, A. I. Yamilov
Format: Article
Language:Russian
Published: Institute of Computer Science 2016-08-01
Series:Компьютерные исследования и моделирование
Subjects:
Online Access:http://crm.ics.org.ru/uploads/crmissues/crm_2016_4/16.08.09.pdf
id doaj-d39056d27cfc4290accdae0e1bc7baf7
record_format Article
spelling doaj-d39056d27cfc4290accdae0e1bc7baf72020-11-25T01:39:02ZrusInstitute of Computer ScienceКомпьютерные исследования и моделирование2076-76332077-68532016-08-018470771610.20537/2076-7633-2016-8-4-707-7162484Languages in China provinces: quantitative estimation with incomplete dataDenis V. DavydovAleksandr B. ShapovalA. I. YamilovThis paper formulates and solves a practical problem of data recovery regarding the distribution of languages on regional level in context of China. The necessity of this recovery is related to the problem of the determination of the linguistic diversity indices, which, in turn, are used to analyze empirically and to predict sources of social and economic development as well as to indicate potential conflicts at regional level. We use Ethnologue database and China census as the initial data sources. For every language spoken in China, the data contains (a) an estimate of China residents who claim this language to be their mother tongue, and (b) indicators of the presence of such residents in China provinces. For each pair language/province, we aim to estimate the number of the province inhabitants that claim the language to be their mother tongue. This base problem is reduced to solving an undetermined system of algebraic equations. Given additional restriction that Ethnologue database introduces data collected at different time moments because of gaps in Ethnologue language surveys and accompanying data collection expenses, we relate those data to a single time moment, that turns the initial task to an ill-posed system of algebraic equations with imprecisely determined right hand side. Therefore, we are looking for an approximate solution characterized by a minimal discrepancy of the system. Since some languages are much less distributed than the others, we minimize the weighted discrepancy, introducing weights that are inverse to the right hand side elements of the equations. This definition of discrepancy allows to recover the required variables. More than 92% of the recovered variables are robust to probabilistic modelling procedure for potential errors in initial data.http://crm.ics.org.ru/uploads/crmissues/crm_2016_4/16.08.09.pdfregional languages usagedissimilarity indicesincomplete data identification
collection DOAJ
language Russian
format Article
sources DOAJ
author Denis V. Davydov
Aleksandr B. Shapoval
A. I. Yamilov
spellingShingle Denis V. Davydov
Aleksandr B. Shapoval
A. I. Yamilov
Languages in China provinces: quantitative estimation with incomplete data
Компьютерные исследования и моделирование
regional languages usage
dissimilarity indices
incomplete data identification
author_facet Denis V. Davydov
Aleksandr B. Shapoval
A. I. Yamilov
author_sort Denis V. Davydov
title Languages in China provinces: quantitative estimation with incomplete data
title_short Languages in China provinces: quantitative estimation with incomplete data
title_full Languages in China provinces: quantitative estimation with incomplete data
title_fullStr Languages in China provinces: quantitative estimation with incomplete data
title_full_unstemmed Languages in China provinces: quantitative estimation with incomplete data
title_sort languages in china provinces: quantitative estimation with incomplete data
publisher Institute of Computer Science
series Компьютерные исследования и моделирование
issn 2076-7633
2077-6853
publishDate 2016-08-01
description This paper formulates and solves a practical problem of data recovery regarding the distribution of languages on regional level in context of China. The necessity of this recovery is related to the problem of the determination of the linguistic diversity indices, which, in turn, are used to analyze empirically and to predict sources of social and economic development as well as to indicate potential conflicts at regional level. We use Ethnologue database and China census as the initial data sources. For every language spoken in China, the data contains (a) an estimate of China residents who claim this language to be their mother tongue, and (b) indicators of the presence of such residents in China provinces. For each pair language/province, we aim to estimate the number of the province inhabitants that claim the language to be their mother tongue. This base problem is reduced to solving an undetermined system of algebraic equations. Given additional restriction that Ethnologue database introduces data collected at different time moments because of gaps in Ethnologue language surveys and accompanying data collection expenses, we relate those data to a single time moment, that turns the initial task to an ill-posed system of algebraic equations with imprecisely determined right hand side. Therefore, we are looking for an approximate solution characterized by a minimal discrepancy of the system. Since some languages are much less distributed than the others, we minimize the weighted discrepancy, introducing weights that are inverse to the right hand side elements of the equations. This definition of discrepancy allows to recover the required variables. More than 92% of the recovered variables are robust to probabilistic modelling procedure for potential errors in initial data.
topic regional languages usage
dissimilarity indices
incomplete data identification
url http://crm.ics.org.ru/uploads/crmissues/crm_2016_4/16.08.09.pdf
work_keys_str_mv AT denisvdavydov languagesinchinaprovincesquantitativeestimationwithincompletedata
AT aleksandrbshapoval languagesinchinaprovincesquantitativeestimationwithincompletedata
AT aiyamilov languagesinchinaprovincesquantitativeestimationwithincompletedata
_version_ 1725050741866889216