Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition
碩士 === 國立成功大學 === 電機工程學系碩博士班 === 97 === Technology always comes from human nature. The growing popularity of multimedia interactive applications in living still has a great room for improvement. How to improve the multimedia interactive technology and bring more convenience for people is our continu...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2009
|
Online Access: | http://ndltd.ncl.edu.tw/handle/21609132787137070832 |
id |
ndltd-TW-097NCKU5442079 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-097NCKU54420792016-05-04T04:17:07Z http://ndltd.ncl.edu.tw/handle/21609132787137070832 Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition 基於聲韻辨識之互動式即時語音驅動人臉系統 Zong-You Chen 陳宗佑 碩士 國立成功大學 電機工程學系碩博士班 97 Technology always comes from human nature. The growing popularity of multimedia interactive applications in living still has a great room for improvement. How to improve the multimedia interactive technology and bring more convenience for people is our continuously target for pursuing. In this thesis, we propose a real-time voice-driven human talking face technology for digital home communication system. For each speech segment, we perform pre-emphasis and hamming windowing first. The 12-order linear predictive cepstral coefficients (LPCCs) are then extracted as the speech feature vector for this segment. The Chinese phonetic symbol recognition is done by the support vector machines (SVMs). The human mouth shape pictures of the 16 Chinese single vowels can be clustered into several groups based on the similarity of the shapes. According to the fact that every person has his own accent and habits while talking, we use sum of absolute difference (SAD) as a shape difference measurement to cluster each mouth shape of user into several categories. Because the categories adopted by each user can fit personal speech characteristic best, the recognition rate and performance are thus enhanced. At last, we use alpha blending to blend the pixels of source and destination pictures by adjusting the transparent level of a picture. This method improves the smoothness between two successive pictures. Experimental results show that the Phoneme Error Rate (PER) is 19.22%. After phoneme clustering, the PER is reduce to 8.78%, and the Word Error Rate (WER) is 27.65%. The MOS for single word recognition, delay and nature for the whole system on average is 3.43 point. Jhing-Fa Wang 王駿發 2009 學位論文 ; thesis 51 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立成功大學 === 電機工程學系碩博士班 === 97 === Technology always comes from human nature. The growing popularity of multimedia interactive applications in living still has a great room for improvement. How to improve the multimedia interactive technology and bring more convenience for people is our continuously target for pursuing.
In this thesis, we propose a real-time voice-driven human talking face technology for digital home communication system. For each speech segment, we perform pre-emphasis and hamming windowing first. The 12-order linear predictive cepstral coefficients (LPCCs) are then extracted as the speech feature vector for this segment. The Chinese phonetic symbol recognition is done by the support vector machines (SVMs).
The human mouth shape pictures of the 16 Chinese single vowels can be clustered into several groups based on the similarity of the shapes. According to the fact that every person has his own accent and habits while talking, we use sum of absolute difference (SAD) as a shape difference measurement to cluster each mouth shape of user into several categories. Because the categories adopted by each user can fit personal speech characteristic best, the recognition rate and performance are thus enhanced.
At last, we use alpha blending to blend the pixels of source and destination pictures by adjusting the transparent level of a picture. This method improves the smoothness between two successive pictures. Experimental results show that the Phoneme Error Rate (PER) is 19.22%. After phoneme clustering, the PER is reduce to 8.78%, and the Word Error Rate (WER) is 27.65%. The MOS for single word recognition, delay and nature for the whole system on average is 3.43 point.
|
author2 |
Jhing-Fa Wang |
author_facet |
Jhing-Fa Wang Zong-You Chen 陳宗佑 |
author |
Zong-You Chen 陳宗佑 |
spellingShingle |
Zong-You Chen 陳宗佑 Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition |
author_sort |
Zong-You Chen |
title |
Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition |
title_short |
Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition |
title_full |
Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition |
title_fullStr |
Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition |
title_full_unstemmed |
Interactive Real-Time Voice-Driven Human Talking Face System Based on Phonetic Recognition |
title_sort |
interactive real-time voice-driven human talking face system based on phonetic recognition |
publishDate |
2009 |
url |
http://ndltd.ncl.edu.tw/handle/21609132787137070832 |
work_keys_str_mv |
AT zongyouchen interactiverealtimevoicedrivenhumantalkingfacesystembasedonphoneticrecognition AT chénzōngyòu interactiverealtimevoicedrivenhumantalkingfacesystembasedonphoneticrecognition AT zongyouchen jīyúshēngyùnbiànshízhīhùdòngshìjíshíyǔyīnqūdòngrénliǎnxìtǒng AT chénzōngyòu jīyúshēngyùnbiànshízhīhùdòngshìjíshíyǔyīnqūdòngrénliǎnxìtǒng |
_version_ |
1718256227480240128 |