Summary: | 碩士 === 國立暨南國際大學 === 資訊工程學系 === 101 === The phrase translation table is the core model component of the state-of-the-art phrase-based statistical machine translation (SMT) systems. Most phrases are induced from word alignment results by using some heuristics to find phrase pairs that are “consistent” with the word alignment results. The phrase translation table is thus affected by the word alignment accuracy as well as the heuristics to find consistent phrase pairs. Without an objective optimization criterion for phrase segmentation, however, a large number of consistent yet noisy phrase pairs may be generated. Furthermore, the phrases are essentially defined in terms of two languages. Such phrases might not respect the individual languages very well. Some specific phrase pairs and phrases might then be induced.
Such a huge and noisy phrase translation table is likely to introduce estimation errors when estimating the phrase translation probability as well as searching (decoding) errors during the training and decoding phases. The large search space might also degrade the speed of the decoding process. To improve the performance of the current phrase-based SMT, it is thus necessary to optimize the phrase segmentation as well as phrase alignment models by jointly considering the results of word alignment and a non-heuristic model for phrase segmentation. By doing this, it might significantly improve the quality and speed of the decoding process and thus the translation fluency.
In particular, an EM algorithm is proposed to conduct phrase segmentation for the source and target language corpora, respectively, independent of each other. The phrase alignment algorithm is then applied to such well-segmented phrases, with good estimates for phrase translation probabilities, which are based on the word alignment statistics. Jointly using the word alignment and phrase segmentation results quantitatively, instead of heuristically, to produce a quality phrase translation table and their translation probability is thus possible.
|