Statistical pattern recognition approaches for retrieval-based machine translation systems

This dissertation addresses the problem of Machine Translation (MT), which is defined as an automated translation of a document written in one language (the source language) to another (the target language) by a computer. The MT task requires various types of knowledge of both the source and target...

Full description

Bibliographic Details
Main Author:	Mansjur, Dwi Sianto
Published:	Georgia Institute of Technology 2012
Subjects:	Machine translation Text categorization Information retrieval Machine learning Pattern recognition Artificial intelligence Pattern perception Pattern recognition systems Machine translating Algorithms
Online Access:	http://hdl.handle.net/1853/42821

id	ndltd-GATECH-oai-smartech.gatech.edu-1853-42821
record_format	oai_dc
spelling	ndltd-GATECH-oai-smartech.gatech.edu-1853-428212013-01-07T20:38:23ZStatistical pattern recognition approaches for retrieval-based machine translation systemsMansjur, Dwi SiantoMachine translationText categorizationInformation retrievalMachine learningPattern recognitionArtificial intelligencePattern perceptionPattern recognition systemsMachine translatingAlgorithmsThis dissertation addresses the problem of Machine Translation (MT), which is defined as an automated translation of a document written in one language (the source language) to another (the target language) by a computer. The MT task requires various types of knowledge of both the source and target language, e.g., linguistic rules and linguistic exceptions. Traditional MT systems rely on an extensive parsing strategy to decode the linguistic rules and use a knowledge base to encode those linguistic exceptions. However, the construction of the knowledge base becomes an issue as the translation system grows. To overcome this difficulty, real translation examples are used instead of a manually-crafted knowledge base. This design strategy is known as the Example-Based Machine Translation (EBMT) principle. Traditional EBMT systems utilize a database of word or phrase translation pairs. The main challenge of this approach is the difficulty of combining the word or phrase translation units into a meaningful and fluent target text. A novel Retrieval-Based Machine Translation (RBMT) system, which uses a sentence-level translation unit, is proposed in this study. An advantage of using the sentence-level translation unit is that the boundary of a sentence is explicitly defined and the semantic, or meaning, is precise in both the source and target language. The main challenge of using a sentential translation unit is the limited coverage, i.e., the difficulty of finding an exact match between a user query and sentences in the source database. Using an electronic dictionary and a topic modeling procedure, we develop a procedure to obtain clusters of sensible variations for each example in the source database. The coverage of our MT system improves because an input query text is matched against a cluster of sensible variations of translation examples instead of being matched against an original source example. In addition, pattern recognition techniques are used to improve the matching procedure, i.e., the design of optimal pattern classifiers and the incorporation of subjective judgments. A high performance statistical pattern classifier is used to identify the target sentences from an input query sentence in our MT system. The proposed classifier is different from the conventional classifier in terms of the way it addresses the generalization capability. A conventional classifier addresses the generalization issue using the parsimony principle and may encounter the possibility of choosing an oversimplified statistical model. The proposed classifier directly addresses the generalization issue in terms of training (empirical) data. Our classifier is expected to generalize better than the conventional classifiers because our classifier is less likely to use over-simplified statistical models based on the available training data. We further improve the matching procedure by the incorporation of subjective judgments. We formulate a novel cost function that combines subjective judgments and the degree of matching between translation examples and an input query. In addition, we provide an optimization strategy for the novel cost function so that the statistical model can be optimized according to the subjective judgments.Georgia Institute of Technology2012-02-17T19:21:52Z2012-02-17T19:21:52Z2011-11-01Dissertationhttp://hdl.handle.net/1853/42821
collection	NDLTD
sources	NDLTD
topic	Machine translation Text categorization Information retrieval Machine learning Pattern recognition Artificial intelligence Pattern perception Pattern recognition systems Machine translating Algorithms
spellingShingle	Machine translation Text categorization Information retrieval Machine learning Pattern recognition Artificial intelligence Pattern perception Pattern recognition systems Machine translating Algorithms Mansjur, Dwi Sianto Statistical pattern recognition approaches for retrieval-based machine translation systems
description	This dissertation addresses the problem of Machine Translation (MT), which is defined as an automated translation of a document written in one language (the source language) to another (the target language) by a computer. The MT task requires various types of knowledge of both the source and target language, e.g., linguistic rules and linguistic exceptions. Traditional MT systems rely on an extensive parsing strategy to decode the linguistic rules and use a knowledge base to encode those linguistic exceptions. However, the construction of the knowledge base becomes an issue as the translation system grows. To overcome this difficulty, real translation examples are used instead of a manually-crafted knowledge base. This design strategy is known as the Example-Based Machine Translation (EBMT) principle. Traditional EBMT systems utilize a database of word or phrase translation pairs. The main challenge of this approach is the difficulty of combining the word or phrase translation units into a meaningful and fluent target text. A novel Retrieval-Based Machine Translation (RBMT) system, which uses a sentence-level translation unit, is proposed in this study. An advantage of using the sentence-level translation unit is that the boundary of a sentence is explicitly defined and the semantic, or meaning, is precise in both the source and target language. The main challenge of using a sentential translation unit is the limited coverage, i.e., the difficulty of finding an exact match between a user query and sentences in the source database. Using an electronic dictionary and a topic modeling procedure, we develop a procedure to obtain clusters of sensible variations for each example in the source database. The coverage of our MT system improves because an input query text is matched against a cluster of sensible variations of translation examples instead of being matched against an original source example. In addition, pattern recognition techniques are used to improve the matching procedure, i.e., the design of optimal pattern classifiers and the incorporation of subjective judgments. A high performance statistical pattern classifier is used to identify the target sentences from an input query sentence in our MT system. The proposed classifier is different from the conventional classifier in terms of the way it addresses the generalization capability. A conventional classifier addresses the generalization issue using the parsimony principle and may encounter the possibility of choosing an oversimplified statistical model. The proposed classifier directly addresses the generalization issue in terms of training (empirical) data. Our classifier is expected to generalize better than the conventional classifiers because our classifier is less likely to use over-simplified statistical models based on the available training data. We further improve the matching procedure by the incorporation of subjective judgments. We formulate a novel cost function that combines subjective judgments and the degree of matching between translation examples and an input query. In addition, we provide an optimization strategy for the novel cost function so that the statistical model can be optimized according to the subjective judgments.
author	Mansjur, Dwi Sianto
author_facet	Mansjur, Dwi Sianto
author_sort	Mansjur, Dwi Sianto
title	Statistical pattern recognition approaches for retrieval-based machine translation systems
title_short	Statistical pattern recognition approaches for retrieval-based machine translation systems
title_full	Statistical pattern recognition approaches for retrieval-based machine translation systems
title_fullStr	Statistical pattern recognition approaches for retrieval-based machine translation systems
title_full_unstemmed	Statistical pattern recognition approaches for retrieval-based machine translation systems
title_sort	statistical pattern recognition approaches for retrieval-based machine translation systems
publisher	Georgia Institute of Technology
publishDate	2012
url	http://hdl.handle.net/1853/42821
work_keys_str_mv	AT mansjurdwisianto statisticalpatternrecognitionapproachesforretrievalbasedmachinetranslationsystems
_version_	1716475645183131648

Statistical pattern recognition approaches for retrieval-based machine translation systems

Similar Items