Personalized Natural-Sounding Speech Synthesis Based on a Small-Sized Corpus

博士 === 國立成功大學 === 電機工程學系 === 104 === The research on speech synthesis often faces two conflicting issues, either fast and inexpensive system construction with low speech quality using insufficient data or time-consuming and labor-intensive efforts for decent speech quality based on a large database....

Full description

Bibliographic Details
Main Authors: Yan-YouChen, 陳彥佑
Other Authors: Jhing-Fa Wang
Format: Others
Language:en_US
Published: 2016
Online Access:http://ndltd.ncl.edu.tw/handle/7t26rz
id ndltd-TW-104NCKU5442057
record_format oai_dc
spelling ndltd-TW-104NCKU54420572019-05-15T22:54:09Z http://ndltd.ncl.edu.tw/handle/7t26rz Personalized Natural-Sounding Speech Synthesis Based on a Small-Sized Corpus 基於少量語料之個人化自然語音合成 Yan-YouChen 陳彥佑 博士 國立成功大學 電機工程學系 104 The research on speech synthesis often faces two conflicting issues, either fast and inexpensive system construction with low speech quality using insufficient data or time-consuming and labor-intensive efforts for decent speech quality based on a large database. The main goal of this dissertation is to develop a speech synthesis system for generating the personalized natural-sounding speech based on a small-sized corpus attempting to compromise between data preparation effort and speech quality. First, according to the demand for the corpus with precise speech segmentation for high quality speech synthesis, this study proposed a speech segmentation algorithm to automatically segment the speech corpus. In this method, articulatory features are first adopted for finding the candidate segmentation points. Then, the minimum description length based segmentation algorithm decides the optimal phone boundaries. Finally, the found phone boundaries are used to refine the segmentation results obtained from the Viterbi-based forced alignment for more precise segmentation, especially for spontaneous speech. Experimental results show the proposed speech segmentation algorithm is able to improve the result of the Viterbi-based approach. On the basis of small corpus, we proposed a hybrid-based speech synthesis technique including candidate expansion, two-level unit selection, and prosodic word-level prosody adjustment. In this method, candidate expansion retrieves the potential units that are unlikely to be retrieved by using only linguistic features. Two-level unit selection mechanism selects the optimal unit sequence from the expanded candidate units by considering the phone and prosodic word levels. Prosodic word-level prosody adjustment verifies the prosodic parameters of each syllable in the prosodic word according to the statistics of the speech corpus and adjusts the prosody of the syllable that fails the prosody verification based on the synthesized result of statistical parameter speech synthesis. Experimental results show that the proposed method is able to generate high quality and natural synthesized speech based on a small corpus. For listener perception, speech personalization and spontaneity are as important as naturalness. Therefore, an approach to generating personalized spontaneous speech is further proposed. In this method, a target speaker’s voice model is first obtained by adapting an average voice model trained in advance. Modulation spectrum-based postfiltering is used for further improving the personalization property as well as alleviating the over-smoothing problem of the synthesized speech. Then, to generate fluent speech, an algorithm for overlapping and smoothing two consecutive speech segmentations is proposed to improve the spontaneity of the generated speech. Experimental results show that the proposed method can effectively model the target speaker’s parameters of fluent transition, including the ratios of overlap length and duration of spontaneous speech, and use these parameters to generate the fluent speech. Jhing-Fa Wang 王駿發 2016 學位論文 ; thesis 96 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 博士 === 國立成功大學 === 電機工程學系 === 104 === The research on speech synthesis often faces two conflicting issues, either fast and inexpensive system construction with low speech quality using insufficient data or time-consuming and labor-intensive efforts for decent speech quality based on a large database. The main goal of this dissertation is to develop a speech synthesis system for generating the personalized natural-sounding speech based on a small-sized corpus attempting to compromise between data preparation effort and speech quality. First, according to the demand for the corpus with precise speech segmentation for high quality speech synthesis, this study proposed a speech segmentation algorithm to automatically segment the speech corpus. In this method, articulatory features are first adopted for finding the candidate segmentation points. Then, the minimum description length based segmentation algorithm decides the optimal phone boundaries. Finally, the found phone boundaries are used to refine the segmentation results obtained from the Viterbi-based forced alignment for more precise segmentation, especially for spontaneous speech. Experimental results show the proposed speech segmentation algorithm is able to improve the result of the Viterbi-based approach. On the basis of small corpus, we proposed a hybrid-based speech synthesis technique including candidate expansion, two-level unit selection, and prosodic word-level prosody adjustment. In this method, candidate expansion retrieves the potential units that are unlikely to be retrieved by using only linguistic features. Two-level unit selection mechanism selects the optimal unit sequence from the expanded candidate units by considering the phone and prosodic word levels. Prosodic word-level prosody adjustment verifies the prosodic parameters of each syllable in the prosodic word according to the statistics of the speech corpus and adjusts the prosody of the syllable that fails the prosody verification based on the synthesized result of statistical parameter speech synthesis. Experimental results show that the proposed method is able to generate high quality and natural synthesized speech based on a small corpus. For listener perception, speech personalization and spontaneity are as important as naturalness. Therefore, an approach to generating personalized spontaneous speech is further proposed. In this method, a target speaker’s voice model is first obtained by adapting an average voice model trained in advance. Modulation spectrum-based postfiltering is used for further improving the personalization property as well as alleviating the over-smoothing problem of the synthesized speech. Then, to generate fluent speech, an algorithm for overlapping and smoothing two consecutive speech segmentations is proposed to improve the spontaneity of the generated speech. Experimental results show that the proposed method can effectively model the target speaker’s parameters of fluent transition, including the ratios of overlap length and duration of spontaneous speech, and use these parameters to generate the fluent speech.
author2 Jhing-Fa Wang
author_facet Jhing-Fa Wang
Yan-YouChen
陳彥佑
author Yan-YouChen
陳彥佑
spellingShingle Yan-YouChen
陳彥佑
Personalized Natural-Sounding Speech Synthesis Based on a Small-Sized Corpus
author_sort Yan-YouChen
title Personalized Natural-Sounding Speech Synthesis Based on a Small-Sized Corpus
title_short Personalized Natural-Sounding Speech Synthesis Based on a Small-Sized Corpus
title_full Personalized Natural-Sounding Speech Synthesis Based on a Small-Sized Corpus
title_fullStr Personalized Natural-Sounding Speech Synthesis Based on a Small-Sized Corpus
title_full_unstemmed Personalized Natural-Sounding Speech Synthesis Based on a Small-Sized Corpus
title_sort personalized natural-sounding speech synthesis based on a small-sized corpus
publishDate 2016
url http://ndltd.ncl.edu.tw/handle/7t26rz
work_keys_str_mv AT yanyouchen personalizednaturalsoundingspeechsynthesisbasedonasmallsizedcorpus
AT chényànyòu personalizednaturalsoundingspeechsynthesisbasedonasmallsizedcorpus
AT yanyouchen jīyúshǎoliàngyǔliàozhīgèrénhuàzìrányǔyīnhéchéng
AT chényànyòu jīyúshǎoliàngyǔliàozhīgèrénhuàzìrányǔyīnhéchéng
_version_ 1719136878116995072