Statistical pattern recognition approaches for retrieval-based machine translation systems
This dissertation addresses the problem of Machine Translation (MT), which is defined as an automated translation of a document written in one language (the source language) to another (the target language) by a computer. The MT task requires various types of knowledge of both the source and target...
Main Author: | |
---|---|
Published: |
Georgia Institute of Technology
2012
|
Subjects: | |
Online Access: | http://hdl.handle.net/1853/42821 |
id |
ndltd-GATECH-oai-smartech.gatech.edu-1853-42821 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-GATECH-oai-smartech.gatech.edu-1853-428212013-01-07T20:38:23ZStatistical pattern recognition approaches for retrieval-based machine translation systemsMansjur, Dwi SiantoMachine translationText categorizationInformation retrievalMachine learningPattern recognitionArtificial intelligencePattern perceptionPattern recognition systemsMachine translatingAlgorithmsThis dissertation addresses the problem of Machine Translation (MT), which is defined as an automated translation of a document written in one language (the source language) to another (the target language) by a computer. The MT task requires various types of knowledge of both the source and target language, e.g., linguistic rules and linguistic exceptions. Traditional MT systems rely on an extensive parsing strategy to decode the linguistic rules and use a knowledge base to encode those linguistic exceptions. However, the construction of the knowledge base becomes an issue as the translation system grows. To overcome this difficulty, real translation examples are used instead of a manually-crafted knowledge base. This design strategy is known as the Example-Based Machine Translation (EBMT) principle. Traditional EBMT systems utilize a database of word or phrase translation pairs. The main challenge of this approach is the difficulty of combining the word or phrase translation units into a meaningful and fluent target text. A novel Retrieval-Based Machine Translation (RBMT) system, which uses a sentence-level translation unit, is proposed in this study. An advantage of using the sentence-level translation unit is that the boundary of a sentence is explicitly defined and the semantic, or meaning, is precise in both the source and target language. The main challenge of using a sentential translation unit is the limited coverage, i.e., the difficulty of finding an exact match between a user query and sentences in the source database. Using an electronic dictionary and a topic modeling procedure, we develop a procedure to obtain clusters of sensible variations for each example in the source database. The coverage of our MT system improves because an input query text is matched against a cluster of sensible variations of translation examples instead of being matched against an original source example. In addition, pattern recognition techniques are used to improve the matching procedure, i.e., the design of optimal pattern classifiers and the incorporation of subjective judgments. A high performance statistical pattern classifier is used to identify the target sentences from an input query sentence in our MT system. The proposed classifier is different from the conventional classifier in terms of the way it addresses the generalization capability. A conventional classifier addresses the generalization issue using the parsimony principle and may encounter the possibility of choosing an oversimplified statistical model. The proposed classifier directly addresses the generalization issue in terms of training (empirical) data. Our classifier is expected to generalize better than the conventional classifiers because our classifier is less likely to use over-simplified statistical models based on the available training data. We further improve the matching procedure by the incorporation of subjective judgments. We formulate a novel cost function that combines subjective judgments and the degree of matching between translation examples and an input query. In addition, we provide an optimization strategy for the novel cost function so that the statistical model can be optimized according to the subjective judgments.Georgia Institute of Technology2012-02-17T19:21:52Z2012-02-17T19:21:52Z2011-11-01Dissertationhttp://hdl.handle.net/1853/42821 |
collection |
NDLTD |
sources |
NDLTD |
topic |
Machine translation Text categorization Information retrieval Machine learning Pattern recognition Artificial intelligence Pattern perception Pattern recognition systems Machine translating Algorithms |
spellingShingle |
Machine translation Text categorization Information retrieval Machine learning Pattern recognition Artificial intelligence Pattern perception Pattern recognition systems Machine translating Algorithms Mansjur, Dwi Sianto Statistical pattern recognition approaches for retrieval-based machine translation systems |
description |
This dissertation addresses the problem of Machine Translation (MT), which is defined as an automated translation of a document written in one language (the source language) to another (the target language) by a computer. The MT task requires various types of knowledge of both the source and target language, e.g., linguistic rules and linguistic exceptions. Traditional MT systems rely on an extensive parsing strategy to decode the linguistic rules and use a knowledge base to encode those linguistic exceptions. However, the construction of the knowledge base becomes an issue as the translation system grows. To overcome this difficulty, real translation examples are used instead of a manually-crafted knowledge base. This design strategy is known as the Example-Based Machine Translation (EBMT) principle. Traditional EBMT systems utilize a database of word or phrase translation pairs. The main challenge of this approach is the difficulty of combining the word or phrase translation units into a meaningful and fluent target text. A novel Retrieval-Based Machine Translation (RBMT) system, which uses a sentence-level translation unit, is proposed in this study. An advantage of using the sentence-level translation unit is that the boundary of a sentence is explicitly defined and the semantic, or meaning, is precise in both the source and target language. The main challenge of using a sentential translation unit is the limited coverage, i.e., the difficulty of finding an exact match between a user query and sentences in the source database. Using an electronic dictionary and a topic modeling procedure, we develop a procedure to obtain clusters of sensible variations for each example in the source database. The coverage of our MT system improves because an input query text is matched against a cluster of sensible variations of translation examples instead of being matched against an original source example. In addition, pattern recognition techniques are used to improve the matching procedure, i.e., the design of optimal pattern classifiers and the incorporation of subjective judgments. A high performance statistical pattern classifier is used to identify the target sentences from an input query sentence in our MT system. The proposed classifier is different from the conventional classifier in terms of the way it addresses the generalization capability. A conventional classifier addresses the generalization issue using the parsimony principle and may encounter the possibility of choosing an oversimplified statistical model. The proposed classifier directly addresses the generalization issue in terms of training (empirical) data. Our classifier is expected to generalize better than the conventional classifiers because our classifier is less likely to use over-simplified statistical models based on the available training data. We further improve the matching procedure by the incorporation of subjective judgments. We formulate a novel cost function that combines subjective judgments and the degree of matching between translation examples and an input query. In addition, we provide an optimization strategy for the novel cost function so that the statistical model can be optimized according to the subjective judgments. |
author |
Mansjur, Dwi Sianto |
author_facet |
Mansjur, Dwi Sianto |
author_sort |
Mansjur, Dwi Sianto |
title |
Statistical pattern recognition approaches for retrieval-based machine translation systems |
title_short |
Statistical pattern recognition approaches for retrieval-based machine translation systems |
title_full |
Statistical pattern recognition approaches for retrieval-based machine translation systems |
title_fullStr |
Statistical pattern recognition approaches for retrieval-based machine translation systems |
title_full_unstemmed |
Statistical pattern recognition approaches for retrieval-based machine translation systems |
title_sort |
statistical pattern recognition approaches for retrieval-based machine translation systems |
publisher |
Georgia Institute of Technology |
publishDate |
2012 |
url |
http://hdl.handle.net/1853/42821 |
work_keys_str_mv |
AT mansjurdwisianto statisticalpatternrecognitionapproachesforretrievalbasedmachinetranslationsystems |
_version_ |
1716475645183131648 |