Designing a general framework for text alignment : case studies with two South Asian languages

Building machine translation systems for many South Asian languages (such as Hindi, Gujarati, etc.) using statistical methods is problematic. The primary reason is insufficient parallel data to learn accurate word alignment. Additionally, these languages are morphologically rich and have free word o...

Full description

Bibliographic Details
Main Author: Aswani, Niraj
Other Authors: Gaizauskas, Robert
Published: University of Sheffield 2012
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.557559
id ndltd-bl.uk-oai-ethos.bl.uk-557559
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-5575592017-10-04T03:25:48ZDesigning a general framework for text alignment : case studies with two South Asian languagesAswani, NirajGaizauskas, Robert2012Building machine translation systems for many South Asian languages (such as Hindi, Gujarati, etc.) using statistical methods is problematic. The primary reason is insufficient parallel data to learn accurate word alignment. Additionally, these languages are morphologically rich and have free word order. When it is difficult to rely purely on statistical methods due to insufficient data, research shows that better performance can be obtained by building hybrid systems that rely on language specific resources, such as morphological analysers or dictionaries, as well as statistical methods. However, it is difficult to find such language specific resources for many South Asian languages. Since languages such as Hindi, Gujarati, Urdu, Bengali, Punjabi and Marathi are all very similar in structure and the main differences lie in the script and vocabulary used for these languages, we hypothesise that it is possible to develop resources for one of these languages and generalize the approach to allow rapid bootstrapping of similar resources for the other closely related languages -- with minimal effort and similar accuracies. To verify this, we develop a few resources for the Hindi language, including a sentence alignment algorithm, a morphological analyser and a transliteration similarity component and generalize the approach to allow rapid bootstrapping of similar resources for the Gujarati language. We show that the approach works on both the Hindi and Gujarati languages and achieves results that are comparable to similar state-of-the-art (SOA) resources available for these languages. We also hypothesise that it is possible to develop a high performance hybrid word alignment algorithm that relies on such language specific resources. To verify this, we design, implement and evaluate a novel English-Hindi hybrid word alignment system that uses the Hindi specific resources developed by us. Not only do we show our word alignment system outperforms other SOA English-Hindi word alignment systems, but also how simple it is to adapt it to the English-Gujarati language pair.491.4University of Sheffieldhttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.557559http://etheses.whiterose.ac.uk/2618/Electronic Thesis or Dissertation
collection NDLTD
sources NDLTD
topic 491.4
spellingShingle 491.4
Aswani, Niraj
Designing a general framework for text alignment : case studies with two South Asian languages
description Building machine translation systems for many South Asian languages (such as Hindi, Gujarati, etc.) using statistical methods is problematic. The primary reason is insufficient parallel data to learn accurate word alignment. Additionally, these languages are morphologically rich and have free word order. When it is difficult to rely purely on statistical methods due to insufficient data, research shows that better performance can be obtained by building hybrid systems that rely on language specific resources, such as morphological analysers or dictionaries, as well as statistical methods. However, it is difficult to find such language specific resources for many South Asian languages. Since languages such as Hindi, Gujarati, Urdu, Bengali, Punjabi and Marathi are all very similar in structure and the main differences lie in the script and vocabulary used for these languages, we hypothesise that it is possible to develop resources for one of these languages and generalize the approach to allow rapid bootstrapping of similar resources for the other closely related languages -- with minimal effort and similar accuracies. To verify this, we develop a few resources for the Hindi language, including a sentence alignment algorithm, a morphological analyser and a transliteration similarity component and generalize the approach to allow rapid bootstrapping of similar resources for the Gujarati language. We show that the approach works on both the Hindi and Gujarati languages and achieves results that are comparable to similar state-of-the-art (SOA) resources available for these languages. We also hypothesise that it is possible to develop a high performance hybrid word alignment algorithm that relies on such language specific resources. To verify this, we design, implement and evaluate a novel English-Hindi hybrid word alignment system that uses the Hindi specific resources developed by us. Not only do we show our word alignment system outperforms other SOA English-Hindi word alignment systems, but also how simple it is to adapt it to the English-Gujarati language pair.
author2 Gaizauskas, Robert
author_facet Gaizauskas, Robert
Aswani, Niraj
author Aswani, Niraj
author_sort Aswani, Niraj
title Designing a general framework for text alignment : case studies with two South Asian languages
title_short Designing a general framework for text alignment : case studies with two South Asian languages
title_full Designing a general framework for text alignment : case studies with two South Asian languages
title_fullStr Designing a general framework for text alignment : case studies with two South Asian languages
title_full_unstemmed Designing a general framework for text alignment : case studies with two South Asian languages
title_sort designing a general framework for text alignment : case studies with two south asian languages
publisher University of Sheffield
publishDate 2012
url http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.557559
work_keys_str_mv AT aswaniniraj designingageneralframeworkfortextalignmentcasestudieswithtwosouthasianlanguages
_version_ 1718543503995174912