Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition

碩士 === 國立成功大學 === 電機工程學系碩博士班 === 97 === Technology always comes from human nature. The growing popularity of multimedia interactive applications in living still has a great room for improvement. How to improve the multimedia interactive technology and bring more convenience for people is our continu...

Full description

Bibliographic Details
Main Authors: Zong-You Chen, 陳宗佑
Other Authors: Jhing-Fa Wang
Format: Others
Language:en_US
Published: 2009
Online Access:http://ndltd.ncl.edu.tw/handle/21609132787137070832
id ndltd-TW-097NCKU5442079
record_format oai_dc
spelling ndltd-TW-097NCKU54420792016-05-04T04:17:07Z http://ndltd.ncl.edu.tw/handle/21609132787137070832 Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition 基於聲韻辨識之互動式即時語音驅動人臉系統 Zong-You Chen 陳宗佑 碩士 國立成功大學 電機工程學系碩博士班 97 Technology always comes from human nature. The growing popularity of multimedia interactive applications in living still has a great room for improvement. How to improve the multimedia interactive technology and bring more convenience for people is our continuously target for pursuing. In this thesis, we propose a real-time voice-driven human talking face technology for digital home communication system. For each speech segment, we perform pre-emphasis and hamming windowing first. The 12-order linear predictive cepstral coefficients (LPCCs) are then extracted as the speech feature vector for this segment. The Chinese phonetic symbol recognition is done by the support vector machines (SVMs). The human mouth shape pictures of the 16 Chinese single vowels can be clustered into several groups based on the similarity of the shapes. According to the fact that every person has his own accent and habits while talking, we use sum of absolute difference (SAD) as a shape difference measurement to cluster each mouth shape of user into several categories. Because the categories adopted by each user can fit personal speech characteristic best, the recognition rate and performance are thus enhanced. At last, we use alpha blending to blend the pixels of source and destination pictures by adjusting the transparent level of a picture. This method improves the smoothness between two successive pictures. Experimental results show that the Phoneme Error Rate (PER) is 19.22%. After phoneme clustering, the PER is reduce to 8.78%, and the Word Error Rate (WER) is 27.65%. The MOS for single word recognition, delay and nature for the whole system on average is 3.43 point. Jhing-Fa Wang 王駿發 2009 學位論文 ; thesis 51 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立成功大學 === 電機工程學系碩博士班 === 97 === Technology always comes from human nature. The growing popularity of multimedia interactive applications in living still has a great room for improvement. How to improve the multimedia interactive technology and bring more convenience for people is our continuously target for pursuing. In this thesis, we propose a real-time voice-driven human talking face technology for digital home communication system. For each speech segment, we perform pre-emphasis and hamming windowing first. The 12-order linear predictive cepstral coefficients (LPCCs) are then extracted as the speech feature vector for this segment. The Chinese phonetic symbol recognition is done by the support vector machines (SVMs). The human mouth shape pictures of the 16 Chinese single vowels can be clustered into several groups based on the similarity of the shapes. According to the fact that every person has his own accent and habits while talking, we use sum of absolute difference (SAD) as a shape difference measurement to cluster each mouth shape of user into several categories. Because the categories adopted by each user can fit personal speech characteristic best, the recognition rate and performance are thus enhanced. At last, we use alpha blending to blend the pixels of source and destination pictures by adjusting the transparent level of a picture. This method improves the smoothness between two successive pictures. Experimental results show that the Phoneme Error Rate (PER) is 19.22%. After phoneme clustering, the PER is reduce to 8.78%, and the Word Error Rate (WER) is 27.65%. The MOS for single word recognition, delay and nature for the whole system on average is 3.43 point.
author2 Jhing-Fa Wang
author_facet Jhing-Fa Wang
Zong-You Chen
陳宗佑
author Zong-You Chen
陳宗佑
spellingShingle Zong-You Chen
陳宗佑
Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition
author_sort Zong-You Chen
title Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition
title_short Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition
title_full Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition
title_fullStr Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition
title_full_unstemmed Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition
title_sort interactive real-time voice-driven human talking face system based on phonetic recognition
publishDate 2009
url http://ndltd.ncl.edu.tw/handle/21609132787137070832
work_keys_str_mv AT zongyouchen interactiverealtimevoicedrivenhumantalkingfacesystembasedonphoneticrecognition
AT chénzōngyòu interactiverealtimevoicedrivenhumantalkingfacesystembasedonphoneticrecognition
AT zongyouchen jīyúshēngyùnbiànshízhīhùdòngshìjíshíyǔyīnqūdòngrénliǎnxìtǒng
AT chénzōngyòu jīyúshēngyùnbiànshízhīhùdòngshìjíshíyǔyīnqūdòngrénliǎnxìtǒng
_version_ 1718256227480240128