Improving Taiwanese Pronunciation Scoring via Multiple Acoustic Models
碩士 === 國立清華大學 === 資訊系統與應用研究所 === 99 === This thesis proposes the use of multiple acoustic models in order to improve Taiwanese pronunciation scoring. All pronunciation scoring used in this research is based on ranking of a phone model against its competing models so that the performance evaluation i...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
2011
|
Online Access: | http://ndltd.ncl.edu.tw/handle/36948546312262458178 |
id |
ndltd-TW-099NTHU5394023 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-099NTHU53940232015-10-13T20:23:01Z http://ndltd.ncl.edu.tw/handle/36948546312262458178 Improving Taiwanese Pronunciation Scoring via Multiple Acoustic Models 使用多重聲學模型以改進台語語音評分 Chen, Hung-Jui 陳宏瑞 碩士 國立清華大學 資訊系統與應用研究所 99 This thesis proposes the use of multiple acoustic models in order to improve Taiwanese pronunciation scoring. All pronunciation scoring used in this research is based on ranking of a phone model against its competing models so that the performance evaluation is also carried out based on ranking. Five training methods are used to generating acoustic models for different purposes. The first method trains acoustic models for the general purpose by using all training corpus. This type of acoustic models serves as out baseline system. The rest of the four training methods aim to solve different problems encountered during the course of pronunciation scoring. They are inaccurate forced alignment, variations in Taiwanese accents, unreliable pronunciation scoring of consonant phonemes, and insufficient training data for certain right-context dependent biphone models. First of all, since the forced alignment results on the beginning and the end of a sentence are usually inaccurate, we discard all sentence-beginning and sentence-end phoneme segments from the training data. We use the remaining training data to train our acoustic models. For the problem of variations in Taiwanese accents, we found that certain occurrances of "er" and "o" sounds in our training data were pronounced similarly and are easily confused in our speech recognition system. We attempt this problem by explicitly increasing the number of mixture components of their corresponding biphone models. For the problem of unreliable pronunciation scoring of consonant phonemes, consonants are usually short in duration and do not have a stable waveform so that they are usually more difficult to model. We tackle this problem by increasing the number of mixture components of all models. For the problem of insufficient training data for certain right-context dependent biphone models, we found that certain biphone instances are rarely seen in our training data, but their corresponding monophone instances are abundant. We therefore increase the number of training iterations on these monophone models before extending them into biphone models. The experimental result shows that the first three training methods can effectively improve the scoring performance while the last method has a light decrease in performance. However, we also found that acoustic models trained from each of the five training methods show satisfactory scoring performance to different set of phone models. We therefore propose a method that uses multiple acoustic models for pronunciation scoring. We look for the best phone model among the five the above-mentioned five types of acoustic models by running an inside test. We then carry out an outside test for scoring by using the corresponding phone models. The experimental result shows that the proposed method exhibits a better performance than any of the above five models. Key words: hidden Markov models, acoustic models, multiple acoustic models. Jang, Jyh-Shing Roger 張智星 2011 學位論文 ; thesis 41 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立清華大學 === 資訊系統與應用研究所 === 99 === This thesis proposes the use of multiple acoustic models in order to improve Taiwanese pronunciation scoring. All pronunciation scoring used in this research is based on ranking of a phone model against its competing models so that the performance evaluation is also carried out based on ranking. Five training methods are used to generating acoustic models for different purposes. The first method trains acoustic models for the general purpose by using all training corpus. This type of acoustic models serves as out baseline system. The rest of the four training methods aim to solve different problems encountered during the course of pronunciation scoring. They are inaccurate forced alignment, variations in Taiwanese accents, unreliable pronunciation scoring of consonant phonemes, and insufficient training data for certain right-context dependent biphone models. First of all, since the forced alignment results on the beginning and the end of a sentence are usually inaccurate, we discard all sentence-beginning and sentence-end phoneme segments from the training data. We use the remaining training data to train our acoustic models. For the problem of variations in Taiwanese accents, we found that certain occurrances of "er" and "o" sounds in our training data were pronounced similarly and are easily confused in our speech recognition system. We attempt this problem by explicitly increasing the number of mixture components of their corresponding biphone models. For the problem of unreliable pronunciation scoring of consonant phonemes, consonants are usually short in duration and do not have a stable waveform so that they are usually more difficult to model. We tackle this problem by increasing the number of mixture components of all models. For the problem of insufficient training data for certain right-context dependent biphone models, we found that certain biphone instances are rarely seen in our training data, but their corresponding monophone instances are abundant. We therefore increase the number of training iterations on these monophone models before extending them into biphone models. The experimental result shows that the first three training methods can effectively improve the scoring performance while the last method has a light decrease in performance. However, we also found that acoustic models trained from each of the five training methods show satisfactory scoring performance to different set of phone models. We therefore propose a method that uses multiple acoustic models for pronunciation scoring. We look for the best phone model among the five the above-mentioned five types of acoustic models by running an inside test. We then carry out an outside test for scoring by using the corresponding phone models. The experimental result shows that the proposed method exhibits a better performance than any of the above five models. Key words: hidden Markov models, acoustic models, multiple acoustic models.
|
author2 |
Jang, Jyh-Shing Roger |
author_facet |
Jang, Jyh-Shing Roger Chen, Hung-Jui 陳宏瑞 |
author |
Chen, Hung-Jui 陳宏瑞 |
spellingShingle |
Chen, Hung-Jui 陳宏瑞 Improving Taiwanese Pronunciation Scoring via Multiple Acoustic Models |
author_sort |
Chen, Hung-Jui |
title |
Improving Taiwanese Pronunciation Scoring via Multiple Acoustic Models |
title_short |
Improving Taiwanese Pronunciation Scoring via Multiple Acoustic Models |
title_full |
Improving Taiwanese Pronunciation Scoring via Multiple Acoustic Models |
title_fullStr |
Improving Taiwanese Pronunciation Scoring via Multiple Acoustic Models |
title_full_unstemmed |
Improving Taiwanese Pronunciation Scoring via Multiple Acoustic Models |
title_sort |
improving taiwanese pronunciation scoring via multiple acoustic models |
publishDate |
2011 |
url |
http://ndltd.ncl.edu.tw/handle/36948546312262458178 |
work_keys_str_mv |
AT chenhungjui improvingtaiwanesepronunciationscoringviamultipleacousticmodels AT chénhóngruì improvingtaiwanesepronunciationscoringviamultipleacousticmodels AT chenhungjui shǐyòngduōzhòngshēngxuémóxíngyǐgǎijìntáiyǔyǔyīnpíngfēn AT chénhóngruì shǐyòngduōzhòngshēngxuémóxíngyǐgǎijìntáiyǔyǔyīnpíngfēn |
_version_ |
1718047472660512768 |