Cross-Lingual Word Sense Disambiguation for Low-Resource Hybrid Machine Translation

<p> This thesis argues that cross-lingual word sense disambiguation (CL-WSD) can be used to improve lexical selection for machine translation when translating from a resource-rich language into an under-resourced one, especially when relatively little bitext is available. In CL-WSD, we perform...

Full description

Bibliographic Details
Main Author: Rudnick, Alexander James
Language:EN
Published: Indiana University 2019
Subjects:
Online Access:http://pqdtopen.proquest.com/#viewpdf?dispub=13422906
id ndltd-PROQUEST-oai-pqdtoai.proquest.com-13422906
record_format oai_dc
spelling ndltd-PROQUEST-oai-pqdtoai.proquest.com-134229062019-01-10T16:13:36Z Cross-Lingual Word Sense Disambiguation for Low-Resource Hybrid Machine Translation Rudnick, Alexander James Linguistics|Artificial intelligence <p> This thesis argues that cross-lingual word sense disambiguation (CL-WSD) can be used to improve lexical selection for machine translation when translating from a resource-rich language into an under-resourced one, especially when relatively little bitext is available. In CL-WSD, we perform word sense disambiguation, considering the senses of a word to be its possible translations into some target language, rather than using a sense inventory developed manually by lexicographers. </p><p> Using explicitly trained classifiers that make use of source-language context and of resources for the source language can help machine translation systems make better decisions when selecting target-language words. This is especially the case when the alternative is hand-written lexical selection rules developed by researchers with linguistic knowledge of the source and target languages, but also true when lexical selection would be performed by a statistical machine translation system, when there is a relatively small amount of available target-language text for training language models. </p><p> In this work, I present the Chipa system for CL-WSD and apply it to the task of translating from Spanish to Guarani and Quechua, two indigenous languages of South America. I demonstrate several extensions to the basic Chipa system, including techniques that allow us to benefit from the wealth of available unannotated Spanish text and existing text analysis tools for Spanish, as well as approaches for learning from bitext resources that pair Spanish with languages unrelated to our intended target languages. Finally, I provide proof-of-concept integrations of Chipa with existing machine translation systems, of two completely different architectures.</p><p> Indiana University 2019-01-08 00:00:00.0 thesis http://pqdtopen.proquest.com/#viewpdf?dispub=13422906 EN
collection NDLTD
language EN
sources NDLTD
topic Linguistics|Artificial intelligence
spellingShingle Linguistics|Artificial intelligence
Rudnick, Alexander James
Cross-Lingual Word Sense Disambiguation for Low-Resource Hybrid Machine Translation
description <p> This thesis argues that cross-lingual word sense disambiguation (CL-WSD) can be used to improve lexical selection for machine translation when translating from a resource-rich language into an under-resourced one, especially when relatively little bitext is available. In CL-WSD, we perform word sense disambiguation, considering the senses of a word to be its possible translations into some target language, rather than using a sense inventory developed manually by lexicographers. </p><p> Using explicitly trained classifiers that make use of source-language context and of resources for the source language can help machine translation systems make better decisions when selecting target-language words. This is especially the case when the alternative is hand-written lexical selection rules developed by researchers with linguistic knowledge of the source and target languages, but also true when lexical selection would be performed by a statistical machine translation system, when there is a relatively small amount of available target-language text for training language models. </p><p> In this work, I present the Chipa system for CL-WSD and apply it to the task of translating from Spanish to Guarani and Quechua, two indigenous languages of South America. I demonstrate several extensions to the basic Chipa system, including techniques that allow us to benefit from the wealth of available unannotated Spanish text and existing text analysis tools for Spanish, as well as approaches for learning from bitext resources that pair Spanish with languages unrelated to our intended target languages. Finally, I provide proof-of-concept integrations of Chipa with existing machine translation systems, of two completely different architectures.</p><p>
author Rudnick, Alexander James
author_facet Rudnick, Alexander James
author_sort Rudnick, Alexander James
title Cross-Lingual Word Sense Disambiguation for Low-Resource Hybrid Machine Translation
title_short Cross-Lingual Word Sense Disambiguation for Low-Resource Hybrid Machine Translation
title_full Cross-Lingual Word Sense Disambiguation for Low-Resource Hybrid Machine Translation
title_fullStr Cross-Lingual Word Sense Disambiguation for Low-Resource Hybrid Machine Translation
title_full_unstemmed Cross-Lingual Word Sense Disambiguation for Low-Resource Hybrid Machine Translation
title_sort cross-lingual word sense disambiguation for low-resource hybrid machine translation
publisher Indiana University
publishDate 2019
url http://pqdtopen.proquest.com/#viewpdf?dispub=13422906
work_keys_str_mv AT rudnickalexanderjames crosslingualwordsensedisambiguationforlowresourcehybridmachinetranslation
_version_ 1718813106356879360