Summary: | 博士 === 國立臺灣科技大學 === 電機工程系 === 95 === Machine transliteration or phonetic transcription plays an important role in the study of natural language processing on topics such as named entity recognition (NER), cross-language information retrieval (CLIR), question answering (QA) and machine translation (MT). It is a process of translating a word in one language into another language by preserving its pronunciation in the original language, otherwise known as translation-by-sound. A collection of transliterations are important to the study of machine transliteration; however, it is time-consuming and labor-intensive to construct such a corpus.
This thesis proposes three learning frameworks for the automatic transliteration extraction from the Web. We formulate the machine transliteration process using a phonetic similarity model (PSM) which consists of phonetic confusion matrices and a Chinese character n-gram language model. With the phonetic similarity model, the extraction of transliteration pairs becomes a two-step process of recognition followed by validation: First, in the recognition process, we identify the most probable transliteration in the k-neighborhood of a spotted English word. Then, in the validation process, we qualify the transliteration pair candidates with a hypothesis test. We also carry out an analytical study on the statistics of several key factors, such as lexical variation and phonetic variation, which result in casual transliteration, in English-Chinese transliteration to help formulation of the phonetic similarity modeling.
In the learning frameworks, we first present supervised learning and unsupervised learning to harvest transliterations from a development corpus. The experimental result validates the effectiveness of the PSM by achieving an F-measure of 0.739 in supervised learning. The unsupervised learning bootstrapping with prior ASR (automated speech recognition) knowledge works very close to the supervised one, thus allowing us to deploy automatic extraction of transliteration pairs in the Web space.
Then, we exploit the active learning algorithm, which actively selects most informative samples for annotation instead of passively receiving samples for learning, to improve performance. It is found that for active learning to reach the performance of supervised learning, the most effective strategy achieves an F-measure of 0.722 and reduces the labeling effort by 90.2%. Finally, we further employ multi-view learning to alleviate the necessity of human annotation and leverage the performance. Two learning strategies, Co-training and Co-EM, are implemented in the unsupervised manner to discover transliterations from the Web. The most effective view setting achieves an F-measure of 0.727. The reported performance shows the effectiveness of our proposed approaches. By exploiting these approaches, we can obtain a set of transliterations easily and quickly from the Web.
|