Best way for collecting data for low-resourced languages

Low resource languages possess a limited number of digitized texts, making it challenging togenerate a satisfactory language audio corpus and information retrieval services. Low resourcelanguages, especially those spoken exclusively in African countries, lack a well-defined andannotated language cor...

Full description

Bibliographic Details
Main Author: Karim, Hiva
Format: Others
Language:English
Published: Högskolan Dalarna, Mikrodataanalys 2020
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:du-35945
id ndltd-UPSALLA1-oai-DiVA.org-du-35945
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-du-359452021-02-04T05:28:04ZBest way for collecting data for low-resourced languagesengKarim, HivaHögskolan Dalarna, Mikrodataanalys2020Computer and Information SciencesData- och informationsvetenskapLow resource languages possess a limited number of digitized texts, making it challenging togenerate a satisfactory language audio corpus and information retrieval services. Low resourcelanguages, especially those spoken exclusively in African countries, lack a well-defined andannotated language corpus, making it a big obstacle for experts to provide a comprehensive textprocessing system. In this study, I Found out the best practices for producing and collectingdata for such zero/low resource languages by means of crowd-sourcing. For the purpose of thisstudy, a number of research articles (n=260) were extracted from Google Scholar, MicrosoftAcademic, and science direct. From these articles, only 60 of them, which met the inclusioncriteria' demands, were considered to review for eligibility. A full-text version of these researcharticles was downloaded and then were carefully screened to ensure eligibility. On the result ofthe eligibility assessment from potentially eligible 60 full-text articles for inclusion, only 25were selected and qualified to include in the final review. The final pool of the selected articles,concerning data generation practices and collection of low resource languages, can beconcluded that speech-based audio data is one of the most common and accessible data types.It can be contended that the collection of audio data from speech-based resources such as nativespeakers of the intended language and available audio recording by taking the advantages ofnew technologies is the most practical, cost-effective, and common method for collecting datafor low resource languages. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:du-35945application/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Others
sources NDLTD
topic Computer and Information Sciences
Data- och informationsvetenskap
spellingShingle Computer and Information Sciences
Data- och informationsvetenskap
Karim, Hiva
Best way for collecting data for low-resourced languages
description Low resource languages possess a limited number of digitized texts, making it challenging togenerate a satisfactory language audio corpus and information retrieval services. Low resourcelanguages, especially those spoken exclusively in African countries, lack a well-defined andannotated language corpus, making it a big obstacle for experts to provide a comprehensive textprocessing system. In this study, I Found out the best practices for producing and collectingdata for such zero/low resource languages by means of crowd-sourcing. For the purpose of thisstudy, a number of research articles (n=260) were extracted from Google Scholar, MicrosoftAcademic, and science direct. From these articles, only 60 of them, which met the inclusioncriteria' demands, were considered to review for eligibility. A full-text version of these researcharticles was downloaded and then were carefully screened to ensure eligibility. On the result ofthe eligibility assessment from potentially eligible 60 full-text articles for inclusion, only 25were selected and qualified to include in the final review. The final pool of the selected articles,concerning data generation practices and collection of low resource languages, can beconcluded that speech-based audio data is one of the most common and accessible data types.It can be contended that the collection of audio data from speech-based resources such as nativespeakers of the intended language and available audio recording by taking the advantages ofnew technologies is the most practical, cost-effective, and common method for collecting datafor low resource languages.
author Karim, Hiva
author_facet Karim, Hiva
author_sort Karim, Hiva
title Best way for collecting data for low-resourced languages
title_short Best way for collecting data for low-resourced languages
title_full Best way for collecting data for low-resourced languages
title_fullStr Best way for collecting data for low-resourced languages
title_full_unstemmed Best way for collecting data for low-resourced languages
title_sort best way for collecting data for low-resourced languages
publisher Högskolan Dalarna, Mikrodataanalys
publishDate 2020
url http://urn.kb.se/resolve?urn=urn:nbn:se:du-35945
work_keys_str_mv AT karimhiva bestwayforcollectingdataforlowresourcedlanguages
_version_ 1719375524587896832