Automatic Identification and Production of Related Words for Historical Linguistics

Language change across space and time is one of the main concerns in historical linguistics. In this article, we develop tools to assist researchers and domain experts in the study of language evolution. First, we introduce a method to automatically determine whether two words...

Full description

Bibliographic Details
Main Authors: Ciobanu, Alina Maria, Dinu, Liviu P.
Format: Article
Language:English
Published: The MIT Press 2020-01-01
Series:Computational Linguistics
Online Access:https://www.mitpressjournals.org/doi/abs/10.1162/coli_a_00361
id doaj-0e81237d25a045f298da1f8f7d71a810
record_format Article
spelling doaj-0e81237d25a045f298da1f8f7d71a8102020-11-25T03:31:09ZengThe MIT PressComputational Linguistics0891-20171530-93122020-01-0145466770410.1162/coli_a_00361Automatic Identification and Production of Related Words for Historical LinguisticsCiobanu, Alina MariaDinu, Liviu P. Language change across space and time is one of the main concerns in historical linguistics. In this article, we develop tools to assist researchers and domain experts in the study of language evolution. First, we introduce a method to automatically determine whether two words are cognates. We propose an algorithm for extracting cognates from electronic dictionaries that contain etymological information. Having built a data set of related words, we further develop machine learning methods based on orthographic alignment for identifying cognates. We use aligned subsequences as features for classification algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates. Second, we extend the method to a finer-grained level, to identify the type of relationship between words. Discriminating between cognates and borrowings provides a deeper insight into the history of a language and allows a better characterization of language relatedness. We show that orthographic features have discriminative power and we analyze the underlying linguistic factors that prove relevant in the classification task. To our knowledge, this is the first attempt of this kind. Third, we develop a machine learning method for automatically producing related words. We focus on reconstructing proto-words, but we also address two related sub-problems, producing modern word forms and producing cognates. The task of reconstructing proto-words consists of recreating the words in an ancient language from its modern daughter languages. Having modern word forms in multiple Romance languages, we infer the form of their common Latin ancestors. Our approach relies on the regularities that occurred when words entered the modern languages. We leverage information from several modern languages, building an ensemble system for reconstructing proto-words. We apply our method to multiple data sets, showing that our approach improves on previous results, also having the advantage of requiring less input data, which is essential in historical linguistics, where resources are generally scarce. https://www.mitpressjournals.org/doi/abs/10.1162/coli_a_00361
collection DOAJ
language English
format Article
sources DOAJ
author Ciobanu, Alina Maria
Dinu, Liviu P.
spellingShingle Ciobanu, Alina Maria
Dinu, Liviu P.
Automatic Identification and Production of Related Words for Historical Linguistics
Computational Linguistics
author_facet Ciobanu, Alina Maria
Dinu, Liviu P.
author_sort Ciobanu, Alina Maria
title Automatic Identification and Production of Related Words for Historical Linguistics
title_short Automatic Identification and Production of Related Words for Historical Linguistics
title_full Automatic Identification and Production of Related Words for Historical Linguistics
title_fullStr Automatic Identification and Production of Related Words for Historical Linguistics
title_full_unstemmed Automatic Identification and Production of Related Words for Historical Linguistics
title_sort automatic identification and production of related words for historical linguistics
publisher The MIT Press
series Computational Linguistics
issn 0891-2017
1530-9312
publishDate 2020-01-01
description Language change across space and time is one of the main concerns in historical linguistics. In this article, we develop tools to assist researchers and domain experts in the study of language evolution. First, we introduce a method to automatically determine whether two words are cognates. We propose an algorithm for extracting cognates from electronic dictionaries that contain etymological information. Having built a data set of related words, we further develop machine learning methods based on orthographic alignment for identifying cognates. We use aligned subsequences as features for classification algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates. Second, we extend the method to a finer-grained level, to identify the type of relationship between words. Discriminating between cognates and borrowings provides a deeper insight into the history of a language and allows a better characterization of language relatedness. We show that orthographic features have discriminative power and we analyze the underlying linguistic factors that prove relevant in the classification task. To our knowledge, this is the first attempt of this kind. Third, we develop a machine learning method for automatically producing related words. We focus on reconstructing proto-words, but we also address two related sub-problems, producing modern word forms and producing cognates. The task of reconstructing proto-words consists of recreating the words in an ancient language from its modern daughter languages. Having modern word forms in multiple Romance languages, we infer the form of their common Latin ancestors. Our approach relies on the regularities that occurred when words entered the modern languages. We leverage information from several modern languages, building an ensemble system for reconstructing proto-words. We apply our method to multiple data sets, showing that our approach improves on previous results, also having the advantage of requiring less input data, which is essential in historical linguistics, where resources are generally scarce.
url https://www.mitpressjournals.org/doi/abs/10.1162/coli_a_00361
work_keys_str_mv AT ciobanualinamaria automaticidentificationandproductionofrelatedwordsforhistoricallinguistics
AT dinuliviup automaticidentificationandproductionofrelatedwordsforhistoricallinguistics
_version_ 1724573344526761984