Using language similarities in retrieval for resource scarce languages: a study of several southern Bantu languages

Most of the Web is published in languages that are not accessible to many potential users who are only able to read and understand their local languages. Many of these local languages are Resources Scarce Languages (RSLs) and lack the necessary resources, such as machine translation tools, to make a...

Full description

Bibliographic Details
Main Author: Chavula, Catherine
Other Authors: Suleman, Hussein
Format: Doctoral Thesis
Language:English
Published: Faculty of Science 2021
Subjects:
Online Access:http://hdl.handle.net/11427/33614
id ndltd-netd.ac.za-oai-union.ndltd.org-uct-oai-localhost-11427-33614
record_format oai_dc
spelling ndltd-netd.ac.za-oai-union.ndltd.org-uct-oai-localhost-11427-336142021-07-16T05:08:48Z Using language similarities in retrieval for resource scarce languages: a study of several southern Bantu languages Chavula, Catherine Suleman, Hussein Resources Scarce Languages Bantu languages Southeastern Africa Most of the Web is published in languages that are not accessible to many potential users who are only able to read and understand their local languages. Many of these local languages are Resources Scarce Languages (RSLs) and lack the necessary resources, such as machine translation tools, to make available content more accessible. State of the art preprocessing tools and retrieval methods are tailored for Web dominant languages and, accordingly, documents written in RSLs are lowly ranked and difficult to access in search results, resulting in a struggling and frustrating search experience for speakers of RSLs. In this thesis, we propose the use of language similarities to match, re-rank and return search results written in closely related languages to improve the quality of search results and user experience. We also explore the use of shared morphological features to build multilingual stemming tools. Focusing on six Bantu languages spoken in Southeastern Africa, we first explore how users would interact with search results written in related languages. We conduct a user study, examining the usefulness and user preferences for ranking search results with different levels of intelligibility, and the types of emotions users experience when interacting with such results. Our results show that users can complete tasks using related language search results but, as intelligibility decreases, more users struggle to complete search tasks and, consequently, experience negative emotions. Concerning ranking, we find that users prefer that relevant documents be ranked higher, and that intelligibility be used as a secondary criterion. Additionally, we use a User-Centered Design (UCD) approach to investigate enhanced interface features that could assist users to effectively interact with such search results. Usability evaluation of our designed interface scored 86% using the System Usability Scale (SUS). We then investigate whether ranking models that integrate relevance and intelligibility features would improve retrieval effectiveness. We develop these features by drawing from traditional Information Retrieval (IR) models and linguistics studies, and employ Learning To Rank (LTR) and unsupervised methods. Our evaluation shows that models that use both relevance and intelligibility feature(s) have better performance when compared to models that use relevance features only. Finally, we propose and evaluate morphological processing approaches that include multilingual stemming, using rules derived from common morphological features across Bantu family of languages. Our evaluation of the proposed stemming approach shows that its performance is competitive on queries that use general terms. Overall, the thesis provides evidence that considering and matching search results written in closely related languages, as well as ranking and presenting them appropriately, improves the quality of retrieval and user experience for speakers of RSLs. 2021-07-13T10:42:09Z 2021-07-13T10:42:09Z 2021_ 2021-07-13T10:40:53Z Doctoral Thesis Doctoral PhD http://hdl.handle.net/11427/33614 eng application/pdf Faculty of Science Department of Computer Science
collection NDLTD
language English
format Doctoral Thesis
sources NDLTD
topic Resources Scarce Languages
Bantu languages
Southeastern Africa
spellingShingle Resources Scarce Languages
Bantu languages
Southeastern Africa
Chavula, Catherine
Using language similarities in retrieval for resource scarce languages: a study of several southern Bantu languages
description Most of the Web is published in languages that are not accessible to many potential users who are only able to read and understand their local languages. Many of these local languages are Resources Scarce Languages (RSLs) and lack the necessary resources, such as machine translation tools, to make available content more accessible. State of the art preprocessing tools and retrieval methods are tailored for Web dominant languages and, accordingly, documents written in RSLs are lowly ranked and difficult to access in search results, resulting in a struggling and frustrating search experience for speakers of RSLs. In this thesis, we propose the use of language similarities to match, re-rank and return search results written in closely related languages to improve the quality of search results and user experience. We also explore the use of shared morphological features to build multilingual stemming tools. Focusing on six Bantu languages spoken in Southeastern Africa, we first explore how users would interact with search results written in related languages. We conduct a user study, examining the usefulness and user preferences for ranking search results with different levels of intelligibility, and the types of emotions users experience when interacting with such results. Our results show that users can complete tasks using related language search results but, as intelligibility decreases, more users struggle to complete search tasks and, consequently, experience negative emotions. Concerning ranking, we find that users prefer that relevant documents be ranked higher, and that intelligibility be used as a secondary criterion. Additionally, we use a User-Centered Design (UCD) approach to investigate enhanced interface features that could assist users to effectively interact with such search results. Usability evaluation of our designed interface scored 86% using the System Usability Scale (SUS). We then investigate whether ranking models that integrate relevance and intelligibility features would improve retrieval effectiveness. We develop these features by drawing from traditional Information Retrieval (IR) models and linguistics studies, and employ Learning To Rank (LTR) and unsupervised methods. Our evaluation shows that models that use both relevance and intelligibility feature(s) have better performance when compared to models that use relevance features only. Finally, we propose and evaluate morphological processing approaches that include multilingual stemming, using rules derived from common morphological features across Bantu family of languages. Our evaluation of the proposed stemming approach shows that its performance is competitive on queries that use general terms. Overall, the thesis provides evidence that considering and matching search results written in closely related languages, as well as ranking and presenting them appropriately, improves the quality of retrieval and user experience for speakers of RSLs.
author2 Suleman, Hussein
author_facet Suleman, Hussein
Chavula, Catherine
author Chavula, Catherine
author_sort Chavula, Catherine
title Using language similarities in retrieval for resource scarce languages: a study of several southern Bantu languages
title_short Using language similarities in retrieval for resource scarce languages: a study of several southern Bantu languages
title_full Using language similarities in retrieval for resource scarce languages: a study of several southern Bantu languages
title_fullStr Using language similarities in retrieval for resource scarce languages: a study of several southern Bantu languages
title_full_unstemmed Using language similarities in retrieval for resource scarce languages: a study of several southern Bantu languages
title_sort using language similarities in retrieval for resource scarce languages: a study of several southern bantu languages
publisher Faculty of Science
publishDate 2021
url http://hdl.handle.net/11427/33614
work_keys_str_mv AT chavulacatherine usinglanguagesimilaritiesinretrievalforresourcescarcelanguagesastudyofseveralsouthernbantulanguages
_version_ 1719417066812866560