Summary: | 碩士 === 國立臺灣科技大學 === 資訊工程系 === 103 === In this thesis, we adopt a new HMM (hidden Markov model) structure, i.e. half (half context-dependent and size) HMM, and the synthetic-speech fluency is apparently improved under the situation of limited training sentences. In addition, we study a method that combines minimum generation error (MGE) based HMM training with formant enhancement or global variance matching to alleviate the problem of spectral over-smoothing, which can improve the signal quality of synthetic speech. When implementing MGE based HMM training, we program two different procedures called formula-simplification procedure and dimension-independence procedure, respectively. According to the results of measuring generation error, the dimension-independence procedure is found to be the better one. In practice, MGE based HMM training has three implementation factors that need to be considered. Therefore, we compare different combinations of the implementation factors in terms of objective measures (average MFCC distance and variance ratio). It is found that keeping covariance matrix unchanged and using initial HMM trained with segmental K-mean method is the better choice. According to the measured average MFCC distances, the ensemble-training flow is found to be better than the incremental training flow studied here. Nevertheless, when the measured variance ratios are considered, the incremental training flow will be the better one. As to formant enhancement, by comparing the spectral envelopes obtained with different methods, we found that the geometric-series method proposed here is better than the constant-series method. As to global variance matching, it is found that an appropriate weight value must be set to prevent abrupt amplitude change and click from occurring. According to the results of listening tests, among the speech synthesis methods using the MGE trained HMM, HMM trained with the incremental-training flow is better than with the ensemble-training flow. The results also show that global variance matching and formant enhancement can improve the signal quality of the synthetic speech basically. Nevertheless, clicks or harsh noises may sometimes be heard in the synthesized speech, which cause their MOS scores being decreased.
|