Automatic Generation of Synthesis Units for Taiwanese Text-to-Speech System

碩士 === 長庚大學 === 電機工程研究所 === 88 === In this thesis, we’ll demonstrate a Taiwanese (Min-nan) text-to-speech (TTS) system based on automatically generated synthetic units. It can read out any modern Taiwanese articles rather naturally. This TTS system is composed of 3 functional modules, namely a text...

Full description

Bibliographic Details
Main Authors: Zhen-Hong Fu, 傅振宏
Other Authors: Ren-yuan Lyu
Format: Others
Language:zh-TW
Published: 2000
Online Access:http://ndltd.ncl.edu.tw/handle/46706238089789082381
Description
Summary:碩士 === 長庚大學 === 電機工程研究所 === 88 === In this thesis, we’ll demonstrate a Taiwanese (Min-nan) text-to-speech (TTS) system based on automatically generated synthetic units. It can read out any modern Taiwanese articles rather naturally. This TTS system is composed of 3 functional modules, namely a text analysis module, a prosody module, and a waveform synthesis module. Modern Taiwanese texts consist of Chinese characters and English alphabets simultaneously. For this reason, the text analysis module should be able to deal with the Chinese-English mixed texts first of all. In this module, text normalization, words segmentation, letter-to-phonemes and word frequency are used to deal with the multi-pronunciation. The prosody module process tone sandhi appearance and phonetic variation in Taiwanese. The synthetic units in the waveform synthesis module come from 2 sources: (1) the isolated-uttered tonal syllables including all possible tonal variations in Taiwanese, totally about 4521 in numbers, (2) the automatically generated synthetic units from a designated speech corpus. We employ a HMM-based large vocabulary Taiwanese speech recognition system to do the forced alignment for the speech corpus. The short pause recognition was proposed in the recognition system. After the synthesis units string has been extracted, the inter-syllable coarticulation information will be applied to decide how to concatenation these units. After the energy normalization, the output speech was generated. We evaluate our system on automatically segmented speech. Comparing with the human segmentation, about 85% correct rate can be achieved. The system was already implemented on a PC running MS-windows 9x/NT/2000.