A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition
博士 === 大同大學 === 資訊工程學系(所) === 97 === In recent years, there have been many machine speechreading systems proposed, that combine audio and visual speech features. For all such systems, the objective of these audio-visual speech recognizers is to improve recognition accuracy, particularly in difficult...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2009
|
Online Access: | http://ndltd.ncl.edu.tw/handle/46704732964354703864 |
id |
ndltd-TW-097TTU05392029 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-097TTU053920292016-05-02T04:11:10Z http://ndltd.ncl.edu.tw/handle/46704732964354703864 A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition 聽視覺特徵擷取在中文數字語音辨識之研究 Wen-Yuan Liao 廖文淵 博士 大同大學 資訊工程學系(所) 97 In recent years, there have been many machine speechreading systems proposed, that combine audio and visual speech features. For all such systems, the objective of these audio-visual speech recognizers is to improve recognition accuracy, particularly in difficult condition. This thesis presents a Mandarin audio-visual recognition system that has better recognition rate in noisy condition as well as speech spoken with emotional condition. We first extract the visual features of the lips, including geometric and motion features. These features are very important to the recognition system especially in noisy condition or with emotional effects. The motion features are obtained by applying an automatic face feature extractor followed by a fast motion feature extractor. We compare the performance when the system using motion and geometric features. In this recognition system, we propose to use the weighted-discrete KNN as the classifier and compare the results with two popular classifiers, the GMM and HMM, and evaluate their performance by applying to a Mandarin audio-visual speech corpus. We find that the WD-KNN is a suitable classifier for Mandarin speech because the monosyllable property of Mandarin and computationally inexpensive. The experimental results of different classifiers at various SNR levels are presented. The results show that using the WD-KNN classifier yields better recognition accuracy than other classifiers for the used Mandarin speech corpus. Several weighting functions were also studied for the weighted KNN based classifier, such as linear distance weighting, inverse distance weighting, rank weighting and reverse Fibonacci weighting function. The overall results have proved that WD-KNN classifier with reverse Fibonacci weighting function gets the higher recognition rate in three extended versions of KNN outperform others. Finally, we perform the emotional speech recognition experiments. The results show that it will be more robust if the visual information is included. The recognition rate of the audio-visual speech recognition system will have higher recognition rate when incorporated with the visual cues. Tsang-Long Pao 包蒼龍 2009 學位論文 ; thesis 106 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
博士 === 大同大學 === 資訊工程學系(所) === 97 === In recent years, there have been many machine speechreading systems proposed, that combine audio and visual speech features. For all such systems, the objective of these audio-visual speech recognizers is to improve recognition accuracy, particularly in difficult condition. This thesis presents a Mandarin audio-visual recognition system that has better recognition rate in noisy condition as well as speech spoken with emotional condition.
We first extract the visual features of the lips, including geometric and motion features. These features are very important to the recognition system especially in noisy condition or with emotional effects. The motion features are obtained by applying an automatic face feature extractor followed by a fast motion feature extractor. We compare the performance when the system using motion and geometric features. In this recognition system, we propose to use the weighted-discrete KNN as the classifier and compare the results with two popular classifiers, the GMM and HMM, and evaluate their performance by applying to a Mandarin audio-visual speech corpus. We find that the WD-KNN is a suitable classifier for Mandarin speech because the monosyllable property of Mandarin and computationally inexpensive.
The experimental results of different classifiers at various SNR levels are presented. The results show that using the WD-KNN classifier yields better recognition accuracy than other classifiers for the used Mandarin speech corpus. Several weighting functions were also studied for the weighted KNN based classifier, such as linear distance weighting, inverse distance weighting, rank weighting and reverse Fibonacci weighting function. The overall results have proved that WD-KNN classifier with reverse Fibonacci weighting function gets the higher recognition rate in three extended versions of KNN outperform others.
Finally, we perform the emotional speech recognition experiments. The results show that it will be more robust if the visual information is included. The recognition rate of the audio-visual speech recognition system will have higher recognition rate when incorporated with the visual cues.
|
author2 |
Tsang-Long Pao |
author_facet |
Tsang-Long Pao Wen-Yuan Liao 廖文淵 |
author |
Wen-Yuan Liao 廖文淵 |
spellingShingle |
Wen-Yuan Liao 廖文淵 A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition |
author_sort |
Wen-Yuan Liao |
title |
A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition |
title_short |
A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition |
title_full |
A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition |
title_fullStr |
A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition |
title_full_unstemmed |
A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition |
title_sort |
study on audio-visual feature extraction for mandarin digit speech recognition |
publishDate |
2009 |
url |
http://ndltd.ncl.edu.tw/handle/46704732964354703864 |
work_keys_str_mv |
AT wenyuanliao astudyonaudiovisualfeatureextractionformandarindigitspeechrecognition AT liàowényuān astudyonaudiovisualfeatureextractionformandarindigitspeechrecognition AT wenyuanliao tīngshìjuétèzhēngxiéqǔzàizhōngwénshùzìyǔyīnbiànshízhīyánjiū AT liàowényuān tīngshìjuétèzhēngxiéqǔzàizhōngwénshùzìyǔyīnbiànshízhīyánjiū AT wenyuanliao studyonaudiovisualfeatureextractionformandarindigitspeechrecognition AT liàowényuān studyonaudiovisualfeatureextractionformandarindigitspeechrecognition |
_version_ |
1718253431662051328 |