Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer

Due to the lack of short vowels or diacritics in Persian orthography, many Natural Language Processing applications for this language, including information retrieval, machine translation, text-to-speech, and automatic speech recognition systems need to disambiguate the input first, in order to be a...

Full description

Bibliographic Details
Main Author:	Nojoumian, Peyman
Language:	en
Published:	2011
Subjects:	Persian Persian computational linguistics diacritizer morphological analyzer heterophonic homograph disambiguation
Online Access:	http://hdl.handle.net/10393/20158

id	ndltd-LACETR-oai-collectionscanada.gc.ca-OOU.#10393-20158
record_format	oai_dc
spelling	ndltd-LACETR-oai-collectionscanada.gc.ca-OOU.#10393-201582014-06-14T03:49:22ZTowards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State TransducerNojoumian, PeymanPersianPersian computational linguisticsdiacritizermorphological analyzerheterophonic homographdisambiguationDue to the lack of short vowels or diacritics in Persian orthography, many Natural Language Processing applications for this language, including information retrieval, machine translation, text-to-speech, and automatic speech recognition systems need to disambiguate the input first, in order to be able to do further processing. In machine translation, for example, the whole text should be correctly diacritized first so that the correct words, parts of speech and meanings are matched and retrieved from the lexicon. This is primarily because of Persian’s ambiguous orthography. In fact, the core engine of any Persian language processor should utilize a diacritizer and a lexical disambiguator. This dissertation describes the design and implementation of an automatic diacritizer for Persian based on the state-of-the-art Finite State Transducer technology developed at Xerox by Beesley & Karttunen (2003). The result of morphological analysis and generation on a test corpus is shown, including the insertion of diacritics. This study will also look at issues that are raised by phonological and semantic ambiguities as a result of short vowels in Persian being absent in the writing system. It suggests a hybrid model (rule-based & inductive) that is inspired by psycholinguistic experiments on the human mental lexicon for the disambiguation of heterophonic homographs in Persian using frequency and collocation information. A syntactic parser can be developed based on the proposed model to discover Ezafe (the linking short vowel /e/ within a noun phrase) or disambiguate homographs, but its implementation is left for future work.2011-08-12T20:14:14Z2011-08-12T20:14:14Z20112011-08-12Thèse / Thesishttp://hdl.handle.net/10393/20158en
collection	NDLTD
language	en
sources	NDLTD
topic	Persian Persian computational linguistics diacritizer morphological analyzer heterophonic homograph disambiguation
spellingShingle	Persian Persian computational linguistics diacritizer morphological analyzer heterophonic homograph disambiguation Nojoumian, Peyman Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer
description	Due to the lack of short vowels or diacritics in Persian orthography, many Natural Language Processing applications for this language, including information retrieval, machine translation, text-to-speech, and automatic speech recognition systems need to disambiguate the input first, in order to be able to do further processing. In machine translation, for example, the whole text should be correctly diacritized first so that the correct words, parts of speech and meanings are matched and retrieved from the lexicon. This is primarily because of Persian’s ambiguous orthography. In fact, the core engine of any Persian language processor should utilize a diacritizer and a lexical disambiguator. This dissertation describes the design and implementation of an automatic diacritizer for Persian based on the state-of-the-art Finite State Transducer technology developed at Xerox by Beesley & Karttunen (2003). The result of morphological analysis and generation on a test corpus is shown, including the insertion of diacritics. This study will also look at issues that are raised by phonological and semantic ambiguities as a result of short vowels in Persian being absent in the writing system. It suggests a hybrid model (rule-based & inductive) that is inspired by psycholinguistic experiments on the human mental lexicon for the disambiguation of heterophonic homographs in Persian using frequency and collocation information. A syntactic parser can be developed based on the proposed model to discover Ezafe (the linking short vowel /e/ within a noun phrase) or disambiguate homographs, but its implementation is left for future work.
author	Nojoumian, Peyman
author_facet	Nojoumian, Peyman
author_sort	Nojoumian, Peyman
title	Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer
title_short	Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer
title_full	Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer
title_fullStr	Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer
title_full_unstemmed	Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer
title_sort	towards the development of an automatic diacritizer for the persian orthography based on the xerox finite state transducer
publishDate	2011
url	http://hdl.handle.net/10393/20158
work_keys_str_mv	AT nojoumianpeyman towardsthedevelopmentofanautomaticdiacritizerforthepersianorthographybasedonthexeroxfinitestatetransducer
_version_	1716669310837981184

Towards the Development of an Automatic Diacritizer for the Persian Orthography based on the Xerox Finite State Transducer

Similar Items