Improving Taiwanese Pronunciation Scoring via Multiple Acoustic Models

碩士 === 國立清華大學 === 資訊系統與應用研究所 === 99 === This thesis proposes the use of multiple acoustic models in order to improve Taiwanese pronunciation scoring. All pronunciation scoring used in this research is based on ranking of a phone model against its competing models so that the performance evaluation i...

Full description

Bibliographic Details
Main Authors: Chen, Hung-Jui, 陳宏瑞
Other Authors: Jang, Jyh-Shing Roger
Format: Others
Language:zh-TW
Published: 2011
Online Access:http://ndltd.ncl.edu.tw/handle/36948546312262458178
id ndltd-TW-099NTHU5394023
record_format oai_dc
spelling ndltd-TW-099NTHU53940232015-10-13T20:23:01Z http://ndltd.ncl.edu.tw/handle/36948546312262458178 Improving Taiwanese Pronunciation Scoring via Multiple Acoustic Models 使用多重聲學模型以改進台語語音評分 Chen, Hung-Jui 陳宏瑞 碩士 國立清華大學 資訊系統與應用研究所 99 This thesis proposes the use of multiple acoustic models in order to improve Taiwanese pronunciation scoring. All pronunciation scoring used in this research is based on ranking of a phone model against its competing models so that the performance evaluation is also carried out based on ranking. Five training methods are used to generating acoustic models for different purposes. The first method trains acoustic models for the general purpose by using all training corpus. This type of acoustic models serves as out baseline system. The rest of the four training methods aim to solve different problems encountered during the course of pronunciation scoring. They are inaccurate forced alignment, variations in Taiwanese accents, unreliable pronunciation scoring of consonant phonemes, and insufficient training data for certain right-context dependent biphone models. First of all, since the forced alignment results on the beginning and the end of a sentence are usually inaccurate, we discard all sentence-beginning and sentence-end phoneme segments from the training data. We use the remaining training data to train our acoustic models. For the problem of variations in Taiwanese accents, we found that certain occurrances of "er" and "o" sounds in our training data were pronounced similarly and are easily confused in our speech recognition system. We attempt this problem by explicitly increasing the number of mixture components of their corresponding biphone models. For the problem of unreliable pronunciation scoring of consonant phonemes, consonants are usually short in duration and do not have a stable waveform so that they are usually more difficult to model. We tackle this problem by increasing the number of mixture components of all models. For the problem of insufficient training data for certain right-context dependent biphone models, we found that certain biphone instances are rarely seen in our training data, but their corresponding monophone instances are abundant. We therefore increase the number of training iterations on these monophone models before extending them into biphone models. The experimental result shows that the first three training methods can effectively improve the scoring performance while the last method has a light decrease in performance. However, we also found that acoustic models trained from each of the five training methods show satisfactory scoring performance to different set of phone models. We therefore propose a method that uses multiple acoustic models for pronunciation scoring. We look for the best phone model among the five the above-mentioned five types of acoustic models by running an inside test. We then carry out an outside test for scoring by using the corresponding phone models. The experimental result shows that the proposed method exhibits a better performance than any of the above five models. Key words: hidden Markov models, acoustic models, multiple acoustic models. Jang, Jyh-Shing Roger 張智星 2011 學位論文 ; thesis 41 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立清華大學 === 資訊系統與應用研究所 === 99 === This thesis proposes the use of multiple acoustic models in order to improve Taiwanese pronunciation scoring. All pronunciation scoring used in this research is based on ranking of a phone model against its competing models so that the performance evaluation is also carried out based on ranking. Five training methods are used to generating acoustic models for different purposes. The first method trains acoustic models for the general purpose by using all training corpus. This type of acoustic models serves as out baseline system. The rest of the four training methods aim to solve different problems encountered during the course of pronunciation scoring. They are inaccurate forced alignment, variations in Taiwanese accents, unreliable pronunciation scoring of consonant phonemes, and insufficient training data for certain right-context dependent biphone models. First of all, since the forced alignment results on the beginning and the end of a sentence are usually inaccurate, we discard all sentence-beginning and sentence-end phoneme segments from the training data. We use the remaining training data to train our acoustic models. For the problem of variations in Taiwanese accents, we found that certain occurrances of "er" and "o" sounds in our training data were pronounced similarly and are easily confused in our speech recognition system. We attempt this problem by explicitly increasing the number of mixture components of their corresponding biphone models. For the problem of unreliable pronunciation scoring of consonant phonemes, consonants are usually short in duration and do not have a stable waveform so that they are usually more difficult to model. We tackle this problem by increasing the number of mixture components of all models. For the problem of insufficient training data for certain right-context dependent biphone models, we found that certain biphone instances are rarely seen in our training data, but their corresponding monophone instances are abundant. We therefore increase the number of training iterations on these monophone models before extending them into biphone models. The experimental result shows that the first three training methods can effectively improve the scoring performance while the last method has a light decrease in performance. However, we also found that acoustic models trained from each of the five training methods show satisfactory scoring performance to different set of phone models. We therefore propose a method that uses multiple acoustic models for pronunciation scoring. We look for the best phone model among the five the above-mentioned five types of acoustic models by running an inside test. We then carry out an outside test for scoring by using the corresponding phone models. The experimental result shows that the proposed method exhibits a better performance than any of the above five models. Key words: hidden Markov models, acoustic models, multiple acoustic models.
author2 Jang, Jyh-Shing Roger
author_facet Jang, Jyh-Shing Roger
Chen, Hung-Jui
陳宏瑞
author Chen, Hung-Jui
陳宏瑞
spellingShingle Chen, Hung-Jui
陳宏瑞
Improving Taiwanese Pronunciation Scoring via Multiple Acoustic Models
author_sort Chen, Hung-Jui
title Improving Taiwanese Pronunciation Scoring via Multiple Acoustic Models
title_short Improving Taiwanese Pronunciation Scoring via Multiple Acoustic Models
title_full Improving Taiwanese Pronunciation Scoring via Multiple Acoustic Models
title_fullStr Improving Taiwanese Pronunciation Scoring via Multiple Acoustic Models
title_full_unstemmed Improving Taiwanese Pronunciation Scoring via Multiple Acoustic Models
title_sort improving taiwanese pronunciation scoring via multiple acoustic models
publishDate 2011
url http://ndltd.ncl.edu.tw/handle/36948546312262458178
work_keys_str_mv AT chenhungjui improvingtaiwanesepronunciationscoringviamultipleacousticmodels
AT chénhóngruì improvingtaiwanesepronunciationscoringviamultipleacousticmodels
AT chenhungjui shǐyòngduōzhòngshēngxuémóxíngyǐgǎijìntáiyǔyǔyīnpíngfēn
AT chénhóngruì shǐyòngduōzhòngshēngxuémóxíngyǐgǎijìntáiyǔyǔyīnpíngfēn
_version_ 1718047472660512768