A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition

博士 === 大同大學 === 資訊工程學系(所) === 97 === In recent years, there have been many machine speechreading systems proposed, that combine audio and visual speech features. For all such systems, the objective of these audio-visual speech recognizers is to improve recognition accuracy, particularly in difficult...

Full description

Bibliographic Details
Main Authors:	Wen-Yuan Liao, 廖文淵
Other Authors:	Tsang-Long Pao
Format:	Others
Language:	en_US
Published:	2009
Online Access:	http://ndltd.ncl.edu.tw/handle/46704732964354703864

id	ndltd-TW-097TTU05392029
record_format	oai_dc
spelling	ndltd-TW-097TTU053920292016-05-02T04:11:10Z http://ndltd.ncl.edu.tw/handle/46704732964354703864 A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition 聽視覺特徵擷取在中文數字語音辨識之研究 Wen-Yuan Liao 廖文淵博士大同大學資訊工程學系(所) 97 In recent years, there have been many machine speechreading systems proposed, that combine audio and visual speech features. For all such systems, the objective of these audio-visual speech recognizers is to improve recognition accuracy, particularly in difficult condition. This thesis presents a Mandarin audio-visual recognition system that has better recognition rate in noisy condition as well as speech spoken with emotional condition. We first extract the visual features of the lips, including geometric and motion features. These features are very important to the recognition system especially in noisy condition or with emotional effects. The motion features are obtained by applying an automatic face feature extractor followed by a fast motion feature extractor. We compare the performance when the system using motion and geometric features. In this recognition system, we propose to use the weighted-discrete KNN as the classifier and compare the results with two popular classifiers, the GMM and HMM, and evaluate their performance by applying to a Mandarin audio-visual speech corpus. We find that the WD-KNN is a suitable classifier for Mandarin speech because the monosyllable property of Mandarin and computationally inexpensive. The experimental results of different classifiers at various SNR levels are presented. The results show that using the WD-KNN classifier yields better recognition accuracy than other classifiers for the used Mandarin speech corpus. Several weighting functions were also studied for the weighted KNN based classifier, such as linear distance weighting, inverse distance weighting, rank weighting and reverse Fibonacci weighting function. The overall results have proved that WD-KNN classifier with reverse Fibonacci weighting function gets the higher recognition rate in three extended versions of KNN outperform others. Finally, we perform the emotional speech recognition experiments. The results show that it will be more robust if the visual information is included. The recognition rate of the audio-visual speech recognition system will have higher recognition rate when incorporated with the visual cues. Tsang-Long Pao 包蒼龍 2009 學位論文 ; thesis 106 en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
description	博士 === 大同大學 === 資訊工程學系(所) === 97 === In recent years, there have been many machine speechreading systems proposed, that combine audio and visual speech features. For all such systems, the objective of these audio-visual speech recognizers is to improve recognition accuracy, particularly in difficult condition. This thesis presents a Mandarin audio-visual recognition system that has better recognition rate in noisy condition as well as speech spoken with emotional condition. We first extract the visual features of the lips, including geometric and motion features. These features are very important to the recognition system especially in noisy condition or with emotional effects. The motion features are obtained by applying an automatic face feature extractor followed by a fast motion feature extractor. We compare the performance when the system using motion and geometric features. In this recognition system, we propose to use the weighted-discrete KNN as the classifier and compare the results with two popular classifiers, the GMM and HMM, and evaluate their performance by applying to a Mandarin audio-visual speech corpus. We find that the WD-KNN is a suitable classifier for Mandarin speech because the monosyllable property of Mandarin and computationally inexpensive. The experimental results of different classifiers at various SNR levels are presented. The results show that using the WD-KNN classifier yields better recognition accuracy than other classifiers for the used Mandarin speech corpus. Several weighting functions were also studied for the weighted KNN based classifier, such as linear distance weighting, inverse distance weighting, rank weighting and reverse Fibonacci weighting function. The overall results have proved that WD-KNN classifier with reverse Fibonacci weighting function gets the higher recognition rate in three extended versions of KNN outperform others. Finally, we perform the emotional speech recognition experiments. The results show that it will be more robust if the visual information is included. The recognition rate of the audio-visual speech recognition system will have higher recognition rate when incorporated with the visual cues.
author2	Tsang-Long Pao
author_facet	Tsang-Long Pao Wen-Yuan Liao 廖文淵
author	Wen-Yuan Liao 廖文淵
spellingShingle	Wen-Yuan Liao 廖文淵 A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition
author_sort	Wen-Yuan Liao
title	A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition
title_short	A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition
title_full	A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition
title_fullStr	A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition
title_full_unstemmed	A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition
title_sort	study on audio-visual feature extraction for mandarin digit speech recognition
publishDate	2009
url	http://ndltd.ncl.edu.tw/handle/46704732964354703864
work_keys_str_mv	AT wenyuanliao astudyonaudiovisualfeatureextractionformandarindigitspeechrecognition AT liàowényuān astudyonaudiovisualfeatureextractionformandarindigitspeechrecognition AT wenyuanliao tīngshìjuétèzhēngxiéqǔzàizhōngwénshùzìyǔyīnbiànshízhīyánjiū AT liàowényuān tīngshìjuétèzhēngxiéqǔzàizhōngwénshùzìyǔyīnbiànshízhīyánjiū AT wenyuanliao studyonaudiovisualfeatureextractionformandarindigitspeechrecognition AT liàowényuān studyonaudiovisualfeatureextractionformandarindigitspeechrecognition
_version_	1718253431662051328

A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition

Similar Items