Improving Translation Fluency with a Monolingual Statistical Machine Translation Model
碩士 === 國立暨南國際大學 === 資訊工程學系 === 96 === While there are many SMT researches for the past tens of years, the performances are far from satisfactory. In translating English to Chinese, for instance, the BLEU Scores (Papineni, 2002) range only between 0.21 and 0.29. Such translation quality is very disfl...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
2008
|
Online Access: | http://ndltd.ncl.edu.tw/handle/44433665590486418899 |
id |
ndltd-TW-096NCNU0392001 |
---|---|
record_format |
oai_dc |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立暨南國際大學 === 資訊工程學系 === 96 === While there are many SMT researches for the past tens of years, the performances are far from satisfactory. In translating English to Chinese, for instance, the BLEU Scores (Papineni, 2002) range only between 0.21 and 0.29. Such translation quality is very disfluent for human readers. The goal of the current work is to propose an statistical post-editing model for improving the fluency of translated sentences.
The main approaches for improving classical SMT had put much energy on the Translation Model (TM) for a long time. Unfortunately, the classical SMT models have very low expressive power. The application of word-for-word or phrase-to-phrase translation and a little bit local re-ordering might not generate fluent target language sentences. In particular, many target specific lexical items and morphemes cannot be generated through this kind of models. The implication is that we might have to go beyond the limitations of the classical SMT models in order to improve the fluency of the translation. In particular, the Language Model (LM) and Searching (or Decoding) process, which have been ignored in past researches, should play more important roles.
In the current work, we propose to adapt an Statistical Post-Editing (SPE) Model to translate disfluent sentences into fluent versions. Such a system can be regarded as an “disfluent-to-fluent” SMT, which can be trained with a Monolingual SMT Model. It is special in that the training corpus can be easily acquired from a large monolingual corpus with fluent target sentences. By generating an disfluent version of the fluent monolingual corpus automatically, one can easily acquire the model parameters for translating disfluent sentences into fluent ones through a similar training process for a standard SMT. With such a model, the most likely fluent sentence for a translated sentence can be searched from an example base. In comparison with standard SMT training, which requires a parallel bilingual corpus, the monolingual corpus is much easier to acquire.
The proposed LM for the current SPE, which is responsible for selecting fluent target segments, will be a phrase-based unigram model, instead of a word-based trigram model, which was widely used in classical SMT. Since a phrase can cover more than 3 words, the selected phrases might be more fluent than word trigrams. Furthermore, we have decided not to define target phrases in terms of chunks of bilingually aligned words. Instead, the best target phrases are directly trained from the monolingual target corpus by optimizing the phrase-based unigram model. Such phrases will fit target grammar perfectly and therefore will generate more fluent sentences in general. Furthermore, the number of such phrases will be much smaller than those randomly combined phrases originated from word-aligned word chunks. As a result, the estimation error will be significantly reduced. The size of the required monolingual training corpus will be much smaller as well. Unlike the rare parallel bilingual training corpus, the amount of such target language corpora is extremely large. Therefore, fluent phrases can be well extracted.
As far as the searching or decoding process is concerned, the proposed method is to search the most likely fluent sentence(s) from an example base or from the Web corpus. Local editing is then applied only to a local region of the example sentence based on the disfluent sentence. Intuitively, those sentences searched from an example base or from the Web corpus will be much more fluent than automaticly combined sentences from the SMT decoding module. Even if local editing is required, the repair will be quite local. The search space for repairing will be significantly constrained by words in the most likely example sentence. Such a post-editing context can thus be regarded as a constrained decoding. The searching error can thus be reduced significantly in comparison with the large search space of the decoding process of a typical SMT. The experiments for some error types of the translation process show that the proposed statistical post-editing model did improve fluency significantly.
|
author2 |
Jing-Shin Chang |
author_facet |
Jing-Shin Chang Sheng-Sian Lin 林勝賢 |
author |
Sheng-Sian Lin 林勝賢 |
spellingShingle |
Sheng-Sian Lin 林勝賢 Improving Translation Fluency with a Monolingual Statistical Machine Translation Model |
author_sort |
Sheng-Sian Lin |
title |
Improving Translation Fluency with a Monolingual Statistical Machine Translation Model |
title_short |
Improving Translation Fluency with a Monolingual Statistical Machine Translation Model |
title_full |
Improving Translation Fluency with a Monolingual Statistical Machine Translation Model |
title_fullStr |
Improving Translation Fluency with a Monolingual Statistical Machine Translation Model |
title_full_unstemmed |
Improving Translation Fluency with a Monolingual Statistical Machine Translation Model |
title_sort |
improving translation fluency with a monolingual statistical machine translation model |
publishDate |
2008 |
url |
http://ndltd.ncl.edu.tw/handle/44433665590486418899 |
work_keys_str_mv |
AT shengsianlin improvingtranslationfluencywithamonolingualstatisticalmachinetranslationmodel AT línshèngxián improvingtranslationfluencywithamonolingualstatisticalmachinetranslationmodel AT shengsianlin gǎishànfānyìliúchàngdùzhīdānyǔtǒngjìshìjīqìfānyìmóshì AT línshèngxián gǎishànfānyìliúchàngdùzhīdānyǔtǒngjìshìjīqìfānyìmóshì |
_version_ |
1718270485927559168 |
spelling |
ndltd-TW-096NCNU03920012016-05-18T04:12:53Z http://ndltd.ncl.edu.tw/handle/44433665590486418899 Improving Translation Fluency with a Monolingual Statistical Machine Translation Model 改善翻譯流暢度之單語統計式機器翻譯模式 Sheng-Sian Lin 林勝賢 碩士 國立暨南國際大學 資訊工程學系 96 While there are many SMT researches for the past tens of years, the performances are far from satisfactory. In translating English to Chinese, for instance, the BLEU Scores (Papineni, 2002) range only between 0.21 and 0.29. Such translation quality is very disfluent for human readers. The goal of the current work is to propose an statistical post-editing model for improving the fluency of translated sentences. The main approaches for improving classical SMT had put much energy on the Translation Model (TM) for a long time. Unfortunately, the classical SMT models have very low expressive power. The application of word-for-word or phrase-to-phrase translation and a little bit local re-ordering might not generate fluent target language sentences. In particular, many target specific lexical items and morphemes cannot be generated through this kind of models. The implication is that we might have to go beyond the limitations of the classical SMT models in order to improve the fluency of the translation. In particular, the Language Model (LM) and Searching (or Decoding) process, which have been ignored in past researches, should play more important roles. In the current work, we propose to adapt an Statistical Post-Editing (SPE) Model to translate disfluent sentences into fluent versions. Such a system can be regarded as an “disfluent-to-fluent” SMT, which can be trained with a Monolingual SMT Model. It is special in that the training corpus can be easily acquired from a large monolingual corpus with fluent target sentences. By generating an disfluent version of the fluent monolingual corpus automatically, one can easily acquire the model parameters for translating disfluent sentences into fluent ones through a similar training process for a standard SMT. With such a model, the most likely fluent sentence for a translated sentence can be searched from an example base. In comparison with standard SMT training, which requires a parallel bilingual corpus, the monolingual corpus is much easier to acquire. The proposed LM for the current SPE, which is responsible for selecting fluent target segments, will be a phrase-based unigram model, instead of a word-based trigram model, which was widely used in classical SMT. Since a phrase can cover more than 3 words, the selected phrases might be more fluent than word trigrams. Furthermore, we have decided not to define target phrases in terms of chunks of bilingually aligned words. Instead, the best target phrases are directly trained from the monolingual target corpus by optimizing the phrase-based unigram model. Such phrases will fit target grammar perfectly and therefore will generate more fluent sentences in general. Furthermore, the number of such phrases will be much smaller than those randomly combined phrases originated from word-aligned word chunks. As a result, the estimation error will be significantly reduced. The size of the required monolingual training corpus will be much smaller as well. Unlike the rare parallel bilingual training corpus, the amount of such target language corpora is extremely large. Therefore, fluent phrases can be well extracted. As far as the searching or decoding process is concerned, the proposed method is to search the most likely fluent sentence(s) from an example base or from the Web corpus. Local editing is then applied only to a local region of the example sentence based on the disfluent sentence. Intuitively, those sentences searched from an example base or from the Web corpus will be much more fluent than automaticly combined sentences from the SMT decoding module. Even if local editing is required, the repair will be quite local. The search space for repairing will be significantly constrained by words in the most likely example sentence. Such a post-editing context can thus be regarded as a constrained decoding. The searching error can thus be reduced significantly in comparison with the large search space of the decoding process of a typical SMT. The experiments for some error types of the translation process show that the proposed statistical post-editing model did improve fluency significantly. Jing-Shin Chang 張景新 2008 學位論文 ; thesis 60 zh-TW |