Summary: | 碩士 === 國立交通大學 === 電信工程研究所 === 100 === The thesis establishes an online English text to speech system. Using the data base based on a woman whose mother language is China read TOEFL article. First through a good tri-phone model to segment data base, then using CMU dictionary and Stanford-Postagger software labeled phone, syllable, word, phrase and sentence five level structure relative position and prosodic information, to establish vocal cave, fundamental frequency, and duration model, expected to product more prosody and rhythm.
According to experiment result, the synthesized prosody still not natural enough. Although compare with speech synthesized from foreign web site, our prosody is more ripple but more blurred and weird rise and fall. Suppose to use rule based method to estimate variety prosodic labels still not accurate enough. So synthesized speech prosody right in general, but having strange ripple in detail.
|