Parsing under-resourced languages: Cross-lingual transfer strategies for Indian languages

Key to fast adaptation of language technologies for any language hinges on the availability of fundamental tools and resources such as monolingual/parallel corpora, annotated corpora, part-of-speech (POS) taggers, parsers and so on. The languages which lack those fundamental resources are often refe...

Full description

Bibliographic Details
Main Author: Ramasamy, Loganathan
Other Authors: Žabokrtský, Zdeněk
Format: Doctoral Thesis
Language:English
Published: 2014
Online Access:http://www.nusl.cz/ntk/nusl-338070
Description
Summary:Key to fast adaptation of language technologies for any language hinges on the availability of fundamental tools and resources such as monolingual/parallel corpora, annotated corpora, part-of-speech (POS) taggers, parsers and so on. The languages which lack those fundamental resources are often referred as under-resourced languages. In this thesis, we address the problem of cross-lingual dependency parsing of under-resourced languages. We apply three methodologies to induce dependency structures: (i) projecting dependencies from a resource-rich language to under-resourced languages via parallel corpus word alignment links (ii) parsing under-resourced languages using parsers whose models are trained on treebanks of other languages, and do not look at actual word forms, but only on POS categories. Here we address the problem of incompatibilities in annotation styles between source side parsers and target side evaluation treebanks by harmonizing annotations to a common standard; and finally (iii) we add a new under-resourced scenario in which we use machine translated parallel corpora instead of human translated corpora for projecting dependencies to under-resourced languages. We apply the aforementioned methodologies to five Indian languages (ILs): Hindi, Urdu, Telugu, Bengali and Tamil (in the order of high...