Best way for collecting data for low-resourced languages
Low resource languages possess a limited number of digitized texts, making it challenging togenerate a satisfactory language audio corpus and information retrieval services. Low resourcelanguages, especially those spoken exclusively in African countries, lack a well-defined andannotated language cor...
Main Author: | |
---|---|
Format: | Others |
Language: | English |
Published: |
Högskolan Dalarna, Mikrodataanalys
2020
|
Subjects: | |
Online Access: | http://urn.kb.se/resolve?urn=urn:nbn:se:du-35945 |
id |
ndltd-UPSALLA1-oai-DiVA.org-du-35945 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-UPSALLA1-oai-DiVA.org-du-359452021-02-04T05:28:04ZBest way for collecting data for low-resourced languagesengKarim, HivaHögskolan Dalarna, Mikrodataanalys2020Computer and Information SciencesData- och informationsvetenskapLow resource languages possess a limited number of digitized texts, making it challenging togenerate a satisfactory language audio corpus and information retrieval services. Low resourcelanguages, especially those spoken exclusively in African countries, lack a well-defined andannotated language corpus, making it a big obstacle for experts to provide a comprehensive textprocessing system. In this study, I Found out the best practices for producing and collectingdata for such zero/low resource languages by means of crowd-sourcing. For the purpose of thisstudy, a number of research articles (n=260) were extracted from Google Scholar, MicrosoftAcademic, and science direct. From these articles, only 60 of them, which met the inclusioncriteria' demands, were considered to review for eligibility. A full-text version of these researcharticles was downloaded and then were carefully screened to ensure eligibility. On the result ofthe eligibility assessment from potentially eligible 60 full-text articles for inclusion, only 25were selected and qualified to include in the final review. The final pool of the selected articles,concerning data generation practices and collection of low resource languages, can beconcluded that speech-based audio data is one of the most common and accessible data types.It can be contended that the collection of audio data from speech-based resources such as nativespeakers of the intended language and available audio recording by taking the advantages ofnew technologies is the most practical, cost-effective, and common method for collecting datafor low resource languages. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:du-35945application/pdfinfo:eu-repo/semantics/openAccess |
collection |
NDLTD |
language |
English |
format |
Others
|
sources |
NDLTD |
topic |
Computer and Information Sciences Data- och informationsvetenskap |
spellingShingle |
Computer and Information Sciences Data- och informationsvetenskap Karim, Hiva Best way for collecting data for low-resourced languages |
description |
Low resource languages possess a limited number of digitized texts, making it challenging togenerate a satisfactory language audio corpus and information retrieval services. Low resourcelanguages, especially those spoken exclusively in African countries, lack a well-defined andannotated language corpus, making it a big obstacle for experts to provide a comprehensive textprocessing system. In this study, I Found out the best practices for producing and collectingdata for such zero/low resource languages by means of crowd-sourcing. For the purpose of thisstudy, a number of research articles (n=260) were extracted from Google Scholar, MicrosoftAcademic, and science direct. From these articles, only 60 of them, which met the inclusioncriteria' demands, were considered to review for eligibility. A full-text version of these researcharticles was downloaded and then were carefully screened to ensure eligibility. On the result ofthe eligibility assessment from potentially eligible 60 full-text articles for inclusion, only 25were selected and qualified to include in the final review. The final pool of the selected articles,concerning data generation practices and collection of low resource languages, can beconcluded that speech-based audio data is one of the most common and accessible data types.It can be contended that the collection of audio data from speech-based resources such as nativespeakers of the intended language and available audio recording by taking the advantages ofnew technologies is the most practical, cost-effective, and common method for collecting datafor low resource languages. |
author |
Karim, Hiva |
author_facet |
Karim, Hiva |
author_sort |
Karim, Hiva |
title |
Best way for collecting data for low-resourced languages |
title_short |
Best way for collecting data for low-resourced languages |
title_full |
Best way for collecting data for low-resourced languages |
title_fullStr |
Best way for collecting data for low-resourced languages |
title_full_unstemmed |
Best way for collecting data for low-resourced languages |
title_sort |
best way for collecting data for low-resourced languages |
publisher |
Högskolan Dalarna, Mikrodataanalys |
publishDate |
2020 |
url |
http://urn.kb.se/resolve?urn=urn:nbn:se:du-35945 |
work_keys_str_mv |
AT karimhiva bestwayforcollectingdataforlowresourcedlanguages |
_version_ |
1719375524587896832 |