alineR: an R Package for Optimizing Feature-Weighted Alignments and Linguistic Distances

Linguistic distance measurements are commonly used in anthropology and biology when quantitative and statistical comparisons between words are needed. This is common, for example, when analyzing linguistic and genetic data. Such comparisons can provide insight into historical population patterns and...

Full description

Bibliographic Details
Main Authors: Downey, Sean S., Sun, Guowei, Norquest, Peter
Other Authors: Univ Arizona, Dept Anthropol
Language:en
Published: R FOUNDATION STATISTICAL COMPUTING 2017
Online Access:http://hdl.handle.net/10150/625224
http://arizona.openrepository.com/arizona/handle/10150/625224
id ndltd-arizona.edu-oai-arizona.openrepository.com-10150-625224
record_format oai_dc
spelling ndltd-arizona.edu-oai-arizona.openrepository.com-10150-6252242017-08-12T03:00:34Z alineR: an R Package for Optimizing Feature-Weighted Alignments and Linguistic Distances Downey, Sean S. Sun, Guowei Norquest, Peter Univ Arizona, Dept Anthropol Linguistic distance measurements are commonly used in anthropology and biology when quantitative and statistical comparisons between words are needed. This is common, for example, when analyzing linguistic and genetic data. Such comparisons can provide insight into historical population patterns and evolutionary processes. However, the most commonly used linguistic distances are derived from edit distances, which do not weight phonetic features that may, for example, represent smaller-scale patterns in linguistic evolution. Thus, computational methods for calculating feature-weighted linguistic distances are needed for linguistic, biological, and evolutionary applications; additionally, the linguistic distances presented here are generic and may have broader applications in fields such as text mining and search, as well as applications in psycholinguistics and morphology. To facilitate this research, we are making available an open-source R software package that performs feature-weighted linguistic distance calculations. The package also includes a supervised learning methodology that uses a genetic algorithm and manually determined alignments to estimate 13 linguistic parameters including feature weights and a skip penalty. Here we present the package and use it to demonstrate the supervised learning methodology by estimating the optimal linguistic parameters for both simulated data and for a sample of Austronesian languages. Our results show that the methodology can estimate these parameters for both simulated and real language data, that optimizing feature weights improves alignment accuracy by approximately 29%, and that optimization significantly affects the resulting distance measurements. Availability: alineR is available on CRAN. 2017-08-10 Article alineR: an R Package for Optimizing Feature-Weighted Alignments and Linguistic Distances. Sean S. Downey, Guowei Sun and Peter Norquest , The R Journal (2017) 9:1, pages 138-152. 2073-4859 http://hdl.handle.net/10150/625224 http://arizona.openrepository.com/arizona/handle/10150/625224 R JOURNAL en https://journal.r-project.org/archive/2017/RJ-2017-005/index.html This article is licensed under a Creative Commons Attribution 4.0 International license. R FOUNDATION STATISTICAL COMPUTING
collection NDLTD
language en
sources NDLTD
description Linguistic distance measurements are commonly used in anthropology and biology when quantitative and statistical comparisons between words are needed. This is common, for example, when analyzing linguistic and genetic data. Such comparisons can provide insight into historical population patterns and evolutionary processes. However, the most commonly used linguistic distances are derived from edit distances, which do not weight phonetic features that may, for example, represent smaller-scale patterns in linguistic evolution. Thus, computational methods for calculating feature-weighted linguistic distances are needed for linguistic, biological, and evolutionary applications; additionally, the linguistic distances presented here are generic and may have broader applications in fields such as text mining and search, as well as applications in psycholinguistics and morphology. To facilitate this research, we are making available an open-source R software package that performs feature-weighted linguistic distance calculations. The package also includes a supervised learning methodology that uses a genetic algorithm and manually determined alignments to estimate 13 linguistic parameters including feature weights and a skip penalty. Here we present the package and use it to demonstrate the supervised learning methodology by estimating the optimal linguistic parameters for both simulated data and for a sample of Austronesian languages. Our results show that the methodology can estimate these parameters for both simulated and real language data, that optimizing feature weights improves alignment accuracy by approximately 29%, and that optimization significantly affects the resulting distance measurements. Availability: alineR is available on CRAN.
author2 Univ Arizona, Dept Anthropol
author_facet Univ Arizona, Dept Anthropol
Downey, Sean S.
Sun, Guowei
Norquest, Peter
author Downey, Sean S.
Sun, Guowei
Norquest, Peter
spellingShingle Downey, Sean S.
Sun, Guowei
Norquest, Peter
alineR: an R Package for Optimizing Feature-Weighted Alignments and Linguistic Distances
author_sort Downey, Sean S.
title alineR: an R Package for Optimizing Feature-Weighted Alignments and Linguistic Distances
title_short alineR: an R Package for Optimizing Feature-Weighted Alignments and Linguistic Distances
title_full alineR: an R Package for Optimizing Feature-Weighted Alignments and Linguistic Distances
title_fullStr alineR: an R Package for Optimizing Feature-Weighted Alignments and Linguistic Distances
title_full_unstemmed alineR: an R Package for Optimizing Feature-Weighted Alignments and Linguistic Distances
title_sort aliner: an r package for optimizing feature-weighted alignments and linguistic distances
publisher R FOUNDATION STATISTICAL COMPUTING
publishDate 2017
url http://hdl.handle.net/10150/625224
http://arizona.openrepository.com/arizona/handle/10150/625224
work_keys_str_mv AT downeyseans alineranrpackageforoptimizingfeatureweightedalignmentsandlinguisticdistances
AT sunguowei alineranrpackageforoptimizingfeatureweightedalignmentsandlinguisticdistances
AT norquestpeter alineranrpackageforoptimizingfeatureweightedalignmentsandlinguisticdistances
_version_ 1718515791959162880