Summary: | 碩士 === 國立臺灣大學 === 資訊工程學研究所 === 100 === Nowadays people can use the speech technology to make their life better. Among the speech technology, speech synthesis is regarded as an important part recently. There are two speech synthesis techniques commonly used. One is the unit selection technique and the other is the HMM-based technique. In the unit selection technique, voice in the corpus is divided into small pieces, and they will be concatenated to generate the synthesized voice. With the HMM-based technique, the acoustic model will be calculated using the acoustic features, and synthesized voice will be generated based on acoustic models.
In this thesis, I used the HMM-based technique to implement the Chinese Text-to-Speech (TTS) system. In this system, it extracts the spectral feature and the frequency feature and context-dependent labels to train the models. After the training stage, it analyzes the text and uses the corresponding models to generate the voice.
In the acoustic model training it needs a large amount of training data to train a high quality model. It is difficult to obtain enough training data, so conventionally we exploit the average acoustic model and speaker adaptation to make training with less data possible. However training models close to the one of the target speaker is difficult for average acoustic models, so the performance of the speaker adaptation is not good. In this thesis, I proposed several methods to find out acoustically similar speakers as the support speakers of the target speaker and use their training data to train support speaker models.
I conducted objective experiments and subjective experiments. The experiments showed support speaker model technique is better than average acoustic model technique, and support speaker model technique can result in better synthesis quality.
|