Summary: | 碩士 === 國立成功大學 === 電機工程學系碩博士班 === 97 === Technology always comes from human nature. The growing popularity of multimedia interactive applications in living still has a great room for improvement. How to improve the multimedia interactive technology and bring more convenience for people is our continuously target for pursuing.
In this thesis, we propose a real-time voice-driven human talking face technology for digital home communication system. For each speech segment, we perform pre-emphasis and hamming windowing first. The 12-order linear predictive cepstral coefficients (LPCCs) are then extracted as the speech feature vector for this segment. The Chinese phonetic symbol recognition is done by the support vector machines (SVMs).
The human mouth shape pictures of the 16 Chinese single vowels can be clustered into several groups based on the similarity of the shapes. According to the fact that every person has his own accent and habits while talking, we use sum of absolute difference (SAD) as a shape difference measurement to cluster each mouth shape of user into several categories. Because the categories adopted by each user can fit personal speech characteristic best, the recognition rate and performance are thus enhanced.
At last, we use alpha blending to blend the pixels of source and destination pictures by adjusting the transparent level of a picture. This method improves the smoothness between two successive pictures. Experimental results show that the Phoneme Error Rate (PER) is 19.22%. After phoneme clustering, the PER is reduce to 8.78%, and the Word Error Rate (WER) is 27.65%. The MOS for single word recognition, delay and nature for the whole system on average is 3.43 point.
|