Personalized Natural-Sounding Speech Synthesis Based on a Small-Sized Corpus

博士 === 國立成功大學 === 電機工程學系 === 104 === The research on speech synthesis often faces two conflicting issues, either fast and inexpensive system construction with low speech quality using insufficient data or time-consuming and labor-intensive efforts for decent speech quality based on a large database....

Full description

Bibliographic Details
Main Authors:	Yan-YouChen, 陳彥佑
Other Authors:	Jhing-Fa Wang
Format:	Others
Language:	en_US
Published:	2016
Online Access:	http://ndltd.ncl.edu.tw/handle/7t26rz

id	ndltd-TW-104NCKU5442057
record_format	oai_dc
spelling	ndltd-TW-104NCKU54420572019-05-15T22:54:09Z http://ndltd.ncl.edu.tw/handle/7t26rz Personalized Natural-Sounding Speech Synthesis Based on a Small-Sized Corpus 基於少量語料之個人化自然語音合成 Yan-YouChen 陳彥佑博士國立成功大學電機工程學系 104 The research on speech synthesis often faces two conflicting issues, either fast and inexpensive system construction with low speech quality using insufficient data or time-consuming and labor-intensive efforts for decent speech quality based on a large database. The main goal of this dissertation is to develop a speech synthesis system for generating the personalized natural-sounding speech based on a small-sized corpus attempting to compromise between data preparation effort and speech quality. First, according to the demand for the corpus with precise speech segmentation for high quality speech synthesis, this study proposed a speech segmentation algorithm to automatically segment the speech corpus. In this method, articulatory features are first adopted for finding the candidate segmentation points. Then, the minimum description length based segmentation algorithm decides the optimal phone boundaries. Finally, the found phone boundaries are used to refine the segmentation results obtained from the Viterbi-based forced alignment for more precise segmentation, especially for spontaneous speech. Experimental results show the proposed speech segmentation algorithm is able to improve the result of the Viterbi-based approach. On the basis of small corpus, we proposed a hybrid-based speech synthesis technique including candidate expansion, two-level unit selection, and prosodic word-level prosody adjustment. In this method, candidate expansion retrieves the potential units that are unlikely to be retrieved by using only linguistic features. Two-level unit selection mechanism selects the optimal unit sequence from the expanded candidate units by considering the phone and prosodic word levels. Prosodic word-level prosody adjustment verifies the prosodic parameters of each syllable in the prosodic word according to the statistics of the speech corpus and adjusts the prosody of the syllable that fails the prosody verification based on the synthesized result of statistical parameter speech synthesis. Experimental results show that the proposed method is able to generate high quality and natural synthesized speech based on a small corpus. For listener perception, speech personalization and spontaneity are as important as naturalness. Therefore, an approach to generating personalized spontaneous speech is further proposed. In this method, a target speaker’s voice model is first obtained by adapting an average voice model trained in advance. Modulation spectrum-based postfiltering is used for further improving the personalization property as well as alleviating the over-smoothing problem of the synthesized speech. Then, to generate fluent speech, an algorithm for overlapping and smoothing two consecutive speech segmentations is proposed to improve the spontaneity of the generated speech. Experimental results show that the proposed method can effectively model the target speaker’s parameters of fluent transition, including the ratios of overlap length and duration of spontaneous speech, and use these parameters to generate the fluent speech. Jhing-Fa Wang 王駿發 2016 學位論文 ; thesis 96 en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
description	博士 === 國立成功大學 === 電機工程學系 === 104 === The research on speech synthesis often faces two conflicting issues, either fast and inexpensive system construction with low speech quality using insufficient data or time-consuming and labor-intensive efforts for decent speech quality based on a large database. The main goal of this dissertation is to develop a speech synthesis system for generating the personalized natural-sounding speech based on a small-sized corpus attempting to compromise between data preparation effort and speech quality. First, according to the demand for the corpus with precise speech segmentation for high quality speech synthesis, this study proposed a speech segmentation algorithm to automatically segment the speech corpus. In this method, articulatory features are first adopted for finding the candidate segmentation points. Then, the minimum description length based segmentation algorithm decides the optimal phone boundaries. Finally, the found phone boundaries are used to refine the segmentation results obtained from the Viterbi-based forced alignment for more precise segmentation, especially for spontaneous speech. Experimental results show the proposed speech segmentation algorithm is able to improve the result of the Viterbi-based approach. On the basis of small corpus, we proposed a hybrid-based speech synthesis technique including candidate expansion, two-level unit selection, and prosodic word-level prosody adjustment. In this method, candidate expansion retrieves the potential units that are unlikely to be retrieved by using only linguistic features. Two-level unit selection mechanism selects the optimal unit sequence from the expanded candidate units by considering the phone and prosodic word levels. Prosodic word-level prosody adjustment verifies the prosodic parameters of each syllable in the prosodic word according to the statistics of the speech corpus and adjusts the prosody of the syllable that fails the prosody verification based on the synthesized result of statistical parameter speech synthesis. Experimental results show that the proposed method is able to generate high quality and natural synthesized speech based on a small corpus. For listener perception, speech personalization and spontaneity are as important as naturalness. Therefore, an approach to generating personalized spontaneous speech is further proposed. In this method, a target speaker’s voice model is first obtained by adapting an average voice model trained in advance. Modulation spectrum-based postfiltering is used for further improving the personalization property as well as alleviating the over-smoothing problem of the synthesized speech. Then, to generate fluent speech, an algorithm for overlapping and smoothing two consecutive speech segmentations is proposed to improve the spontaneity of the generated speech. Experimental results show that the proposed method can effectively model the target speaker’s parameters of fluent transition, including the ratios of overlap length and duration of spontaneous speech, and use these parameters to generate the fluent speech.
author2	Jhing-Fa Wang
author_facet	Jhing-Fa Wang Yan-YouChen 陳彥佑
author	Yan-YouChen 陳彥佑
spellingShingle	Yan-YouChen 陳彥佑 Personalized Natural-Sounding Speech Synthesis Based on a Small-Sized Corpus
author_sort	Yan-YouChen
title	Personalized Natural-Sounding Speech Synthesis Based on a Small-Sized Corpus
title_short	Personalized Natural-Sounding Speech Synthesis Based on a Small-Sized Corpus
title_full	Personalized Natural-Sounding Speech Synthesis Based on a Small-Sized Corpus
title_fullStr	Personalized Natural-Sounding Speech Synthesis Based on a Small-Sized Corpus
title_full_unstemmed	Personalized Natural-Sounding Speech Synthesis Based on a Small-Sized Corpus
title_sort	personalized natural-sounding speech synthesis based on a small-sized corpus
publishDate	2016
url	http://ndltd.ncl.edu.tw/handle/7t26rz
work_keys_str_mv	AT yanyouchen personalizednaturalsoundingspeechsynthesisbasedonasmallsizedcorpus AT chényànyòu personalizednaturalsoundingspeechsynthesisbasedonasmallsizedcorpus AT yanyouchen jīyúshǎoliàngyǔliàozhīgèrénhuàzìrányǔyīnhéchéng AT chényànyòu jīyúshǎoliàngyǔliàozhīgèrénhuàzìrányǔyīnhéchéng
_version_	1719136878116995072

Personalized Natural-Sounding Speech Synthesis Based on a Small-Sized Corpus

Similar Items