Summary: | 博士 === 國立成功大學 === 電機工程學系 === 104 === The research on speech synthesis often faces two conflicting issues, either fast and inexpensive system construction with low speech quality using insufficient data or time-consuming and labor-intensive efforts for decent speech quality based on a large database. The main goal of this dissertation is to develop a speech synthesis system for generating the personalized natural-sounding speech based on a small-sized corpus attempting to compromise between data preparation effort and speech quality.
First, according to the demand for the corpus with precise speech segmentation for high quality speech synthesis, this study proposed a speech segmentation algorithm to automatically segment the speech corpus. In this method, articulatory features are first adopted for finding the candidate segmentation points. Then, the minimum description length based segmentation algorithm decides the optimal phone boundaries. Finally, the found phone boundaries are used to refine the segmentation results obtained from the Viterbi-based forced alignment for more precise segmentation, especially for spontaneous speech. Experimental results show the proposed speech segmentation algorithm is able to improve the result of the Viterbi-based approach.
On the basis of small corpus, we proposed a hybrid-based speech synthesis technique including candidate expansion, two-level unit selection, and prosodic word-level prosody adjustment. In this method, candidate expansion retrieves the potential units that are unlikely to be retrieved by using only linguistic features. Two-level unit selection mechanism selects the optimal unit sequence from the expanded candidate units by considering the phone and prosodic word levels. Prosodic word-level prosody adjustment verifies the prosodic parameters of each syllable in the prosodic word according to the statistics of the speech corpus and adjusts the prosody of the syllable that fails the prosody verification based on the synthesized result of statistical parameter speech synthesis. Experimental results show that the proposed method is able to generate high quality and natural synthesized speech based on a small corpus.
For listener perception, speech personalization and spontaneity are as important as naturalness. Therefore, an approach to generating personalized spontaneous speech is further proposed. In this method, a target speaker’s voice model is first obtained by adapting an average voice model trained in advance. Modulation spectrum-based postfiltering is used for further improving the personalization property as well as alleviating the over-smoothing problem of the synthesized speech. Then, to generate fluent speech, an algorithm for overlapping and smoothing two consecutive speech segmentations is proposed to improve the spontaneity of the generated speech. Experimental results show that the proposed method can effectively model the target speaker’s parameters of fluent transition, including the ratios of overlap length and duration of spontaneous speech, and use these parameters to generate the fluent speech.
|