Orthographic measures of language distances between the official South African languages

Two methods for objectively measuring similarities and dissimilarities between the eleven official languages of South Africa are described. The first concerns the use of n-grams. The confusions between different languages in a text-based language identification system can be used to derive informati...

Full description

Bibliographic Details
Main Authors:	P.N. Zulu, G. Botha, E. Barnard
Format:	Article
Language:	Afrikaans
Published:	AOSIS 2008-07-01
Series:	Literator
Subjects:	Clustering Language Distances Language Identification Levenshtein Distance N-Gram
Online Access:	https://literator.org.za/index.php/literator/article/view/106

id	doaj-bc0d23ab50fe42508fc7541b58aa0df6
record_format	Article
spelling	doaj-bc0d23ab50fe42508fc7541b58aa0df62020-11-24T23:15:15ZafrAOSISLiterator0258-22792219-82372008-07-0129118520410.4102/lit.v29i1.10689Orthographic measures of language distances between the official South African languagesP.N. Zulu0G. Botha1E. Barnard2Human Language Technologies Research Group, CSIR & Department of Electrical and Computer Engineering, University of PretoriaHuman Language Technologies Research Group, CSIR & Department of Electrical and Computer Engineering, University of PretoriaHuman Language Technologies Research Group, CSIR & Department of Electrical and Computer Engineering, University of PretoriaTwo methods for objectively measuring similarities and dissimilarities between the eleven official languages of South Africa are described. The first concerns the use of n-grams. The confusions between different languages in a text-based language identification system can be used to derive information on the relationships between the languages. Our classifier calculates n-gram statistics from text documents and then uses these statistics as features in classification. We show that the classification results of a validation test can be used as a similarity measure of the relationship between languages. Using the similarity measures, we were able to represent the relationships graphically. We also apply the Levenshtein distance measure to the orthographic word transcriptions from the eleven South African languages under investigation. Hierarchical clustering of the distances between the different languages shows the relationships between the languages in terms of regional groupings and closeness. Both multidimensional scaling and dendrogram analysis reveal results similar to well-known language groupings, and also suggest a finer level of detail on these relationships.https://literator.org.za/index.php/literator/article/view/106ClusteringLanguage DistancesLanguage IdentificationLevenshtein DistanceN-Gram
collection	DOAJ
language	Afrikaans
format	Article
sources	DOAJ
author	P.N. Zulu G. Botha E. Barnard
spellingShingle	P.N. Zulu G. Botha E. Barnard Orthographic measures of language distances between the official South African languages Literator Clustering Language Distances Language Identification Levenshtein Distance N-Gram
author_facet	P.N. Zulu G. Botha E. Barnard
author_sort	P.N. Zulu
title	Orthographic measures of language distances between the official South African languages
title_short	Orthographic measures of language distances between the official South African languages
title_full	Orthographic measures of language distances between the official South African languages
title_fullStr	Orthographic measures of language distances between the official South African languages
title_full_unstemmed	Orthographic measures of language distances between the official South African languages
title_sort	orthographic measures of language distances between the official south african languages
publisher	AOSIS
series	Literator
issn	0258-2279 2219-8237
publishDate	2008-07-01
description	Two methods for objectively measuring similarities and dissimilarities between the eleven official languages of South Africa are described. The first concerns the use of n-grams. The confusions between different languages in a text-based language identification system can be used to derive information on the relationships between the languages. Our classifier calculates n-gram statistics from text documents and then uses these statistics as features in classification. We show that the classification results of a validation test can be used as a similarity measure of the relationship between languages. Using the similarity measures, we were able to represent the relationships graphically. We also apply the Levenshtein distance measure to the orthographic word transcriptions from the eleven South African languages under investigation. Hierarchical clustering of the distances between the different languages shows the relationships between the languages in terms of regional groupings and closeness. Both multidimensional scaling and dendrogram analysis reveal results similar to well-known language groupings, and also suggest a finer level of detail on these relationships.
topic	Clustering Language Distances Language Identification Levenshtein Distance N-Gram
url	https://literator.org.za/index.php/literator/article/view/106
work_keys_str_mv	AT pnzulu orthographicmeasuresoflanguagedistancesbetweentheofficialsouthafricanlanguages AT gbotha orthographicmeasuresoflanguagedistancesbetweentheofficialsouthafricanlanguages AT ebarnard orthographicmeasuresoflanguagedistancesbetweentheofficialsouthafricanlanguages
_version_	1725591441406689280

Orthographic measures of language distances between the official South African languages

Similar Items