Automatic Identification and Production of Related Words for Historical Linguistics

Language change across space and time is one of the main concerns in historical linguistics. In this article, we develop tools to assist researchers and domain experts in the study of language evolution. First, we introduce a method to automatically determine whether two words...

Full description

Bibliographic Details
Main Authors:	Ciobanu, Alina Maria, Dinu, Liviu P.
Format:	Article
Language:	English
Published:	The MIT Press 2020-01-01
Series:	Computational Linguistics
Online Access:	https://www.mitpressjournals.org/doi/abs/10.1162/coli_a_00361

id	doaj-0e81237d25a045f298da1f8f7d71a810
record_format	Article
spelling	doaj-0e81237d25a045f298da1f8f7d71a8102020-11-25T03:31:09ZengThe MIT PressComputational Linguistics0891-20171530-93122020-01-0145466770410.1162/coli_a_00361Automatic Identification and Production of Related Words for Historical LinguisticsCiobanu, Alina MariaDinu, Liviu P. Language change across space and time is one of the main concerns in historical linguistics. In this article, we develop tools to assist researchers and domain experts in the study of language evolution. First, we introduce a method to automatically determine whether two words are cognates. We propose an algorithm for extracting cognates from electronic dictionaries that contain etymological information. Having built a data set of related words, we further develop machine learning methods based on orthographic alignment for identifying cognates. We use aligned subsequences as features for classification algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates. Second, we extend the method to a finer-grained level, to identify the type of relationship between words. Discriminating between cognates and borrowings provides a deeper insight into the history of a language and allows a better characterization of language relatedness. We show that orthographic features have discriminative power and we analyze the underlying linguistic factors that prove relevant in the classification task. To our knowledge, this is the first attempt of this kind. Third, we develop a machine learning method for automatically producing related words. We focus on reconstructing proto-words, but we also address two related sub-problems, producing modern word forms and producing cognates. The task of reconstructing proto-words consists of recreating the words in an ancient language from its modern daughter languages. Having modern word forms in multiple Romance languages, we infer the form of their common Latin ancestors. Our approach relies on the regularities that occurred when words entered the modern languages. We leverage information from several modern languages, building an ensemble system for reconstructing proto-words. We apply our method to multiple data sets, showing that our approach improves on previous results, also having the advantage of requiring less input data, which is essential in historical linguistics, where resources are generally scarce. https://www.mitpressjournals.org/doi/abs/10.1162/coli_a_00361
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Ciobanu, Alina Maria Dinu, Liviu P.
spellingShingle	Ciobanu, Alina Maria Dinu, Liviu P. Automatic Identification and Production of Related Words for Historical Linguistics Computational Linguistics
author_facet	Ciobanu, Alina Maria Dinu, Liviu P.
author_sort	Ciobanu, Alina Maria
title	Automatic Identification and Production of Related Words for Historical Linguistics
title_short	Automatic Identification and Production of Related Words for Historical Linguistics
title_full	Automatic Identification and Production of Related Words for Historical Linguistics
title_fullStr	Automatic Identification and Production of Related Words for Historical Linguistics
title_full_unstemmed	Automatic Identification and Production of Related Words for Historical Linguistics
title_sort	automatic identification and production of related words for historical linguistics
publisher	The MIT Press
series	Computational Linguistics
issn	0891-2017 1530-9312
publishDate	2020-01-01
description	Language change across space and time is one of the main concerns in historical linguistics. In this article, we develop tools to assist researchers and domain experts in the study of language evolution. First, we introduce a method to automatically determine whether two words are cognates. We propose an algorithm for extracting cognates from electronic dictionaries that contain etymological information. Having built a data set of related words, we further develop machine learning methods based on orthographic alignment for identifying cognates. We use aligned subsequences as features for classification algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates. Second, we extend the method to a finer-grained level, to identify the type of relationship between words. Discriminating between cognates and borrowings provides a deeper insight into the history of a language and allows a better characterization of language relatedness. We show that orthographic features have discriminative power and we analyze the underlying linguistic factors that prove relevant in the classification task. To our knowledge, this is the first attempt of this kind. Third, we develop a machine learning method for automatically producing related words. We focus on reconstructing proto-words, but we also address two related sub-problems, producing modern word forms and producing cognates. The task of reconstructing proto-words consists of recreating the words in an ancient language from its modern daughter languages. Having modern word forms in multiple Romance languages, we infer the form of their common Latin ancestors. Our approach relies on the regularities that occurred when words entered the modern languages. We leverage information from several modern languages, building an ensemble system for reconstructing proto-words. We apply our method to multiple data sets, showing that our approach improves on previous results, also having the advantage of requiring less input data, which is essential in historical linguistics, where resources are generally scarce.
url	https://www.mitpressjournals.org/doi/abs/10.1162/coli_a_00361
work_keys_str_mv	AT ciobanualinamaria automaticidentificationandproductionofrelatedwordsforhistoricallinguistics AT dinuliviup automaticidentificationandproductionofrelatedwordsforhistoricallinguistics
_version_	1724573344526761984

Automatic Identification and Production of Related Words for Historical Linguistics

Similar Items