Summary: | 碩士 === 國立臺灣科技大學 === 資訊工程系 === 105 === In this thesis, normalization methods for syllable
initial and final durations are studied. Also, a feature set
is designed for Weka to construct classification and regression
trees (CART) to predict the syllable initial and final
durations of a text sentence to be synthesized. We hope to
combine the two studies (duration normalization and duration
prediction in terms of CART),to increase the naturalness level
of the synthesized speech especially in the arrangement of
initial an final durations. In the training stage, the original
durations of syllable initial and final are obtained by reading
the corresponding label file of a training sentence. Then, the
method, two level standard deviation matching, proposed here
is used to normalize the durations of syllable initials and
finals. Next, the software, Weka, is used to construct two CART
trees for the durations of syllable initials and finals
respectively. In the synthesis stage, we develop program
modules to predict the duration of a syllable initial or final
according to the two CART constructed by Weka. Then these
program modules are integrated to the speech synthesis system
developed by predecessor researchers. Hence, the system can
synthesize speech signals according to the duration
normalization and prediction methods studied in this thesis.
By using the synthesized speechs, we conduct two types of
listening tests including naturalness level comparison and
naturalness level MOS evaluation. According to the average
scores obtained from the listening tests, naturalness level
comparison, the duration prediction method studied here is
indeed better than the method provided by predecessor
researchers. This is because the arrangement of syllable
initial and final durations by our method is more natural. In
addition, according to the average scores obtained from the
listening tests, naturalness level MOS evaluation, most
participants agree that the synthetic speechs by using our
duration prediction method are very close to the corresponding speechs uttered by a real speaker. In details, the average
scores of our synthetic speechs are all greater than 3.5 points,
and one of them is greater than 4 points. Therefore, the
naturalness level of the synthetic speechs by using our
duration normalization and prediction methods is very close to
the speechs uttered by a real person.
|