A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition

博士 === 大同大學 === 資訊工程學系(所) === 97 === In recent years, there have been many machine speechreading systems proposed, that combine audio and visual speech features. For all such systems, the objective of these audio-visual speech recognizers is to improve recognition accuracy, particularly in difficult...

Full description

Bibliographic Details
Main Authors: Wen-Yuan Liao, 廖文淵
Other Authors: Tsang-Long Pao
Format: Others
Language:en_US
Published: 2009
Online Access:http://ndltd.ncl.edu.tw/handle/46704732964354703864
id ndltd-TW-097TTU05392029
record_format oai_dc
spelling ndltd-TW-097TTU053920292016-05-02T04:11:10Z http://ndltd.ncl.edu.tw/handle/46704732964354703864 A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition 聽視覺特徵擷取在中文數字語音辨識之研究 Wen-Yuan Liao 廖文淵 博士 大同大學 資訊工程學系(所) 97 In recent years, there have been many machine speechreading systems proposed, that combine audio and visual speech features. For all such systems, the objective of these audio-visual speech recognizers is to improve recognition accuracy, particularly in difficult condition. This thesis presents a Mandarin audio-visual recognition system that has better recognition rate in noisy condition as well as speech spoken with emotional condition. We first extract the visual features of the lips, including geometric and motion features. These features are very important to the recognition system especially in noisy condition or with emotional effects. The motion features are obtained by applying an automatic face feature extractor followed by a fast motion feature extractor. We compare the performance when the system using motion and geometric features. In this recognition system, we propose to use the weighted-discrete KNN as the classifier and compare the results with two popular classifiers, the GMM and HMM, and evaluate their performance by applying to a Mandarin audio-visual speech corpus. We find that the WD-KNN is a suitable classifier for Mandarin speech because the monosyllable property of Mandarin and computationally inexpensive. The experimental results of different classifiers at various SNR levels are presented. The results show that using the WD-KNN classifier yields better recognition accuracy than other classifiers for the used Mandarin speech corpus. Several weighting functions were also studied for the weighted KNN based classifier, such as linear distance weighting, inverse distance weighting, rank weighting and reverse Fibonacci weighting function. The overall results have proved that WD-KNN classifier with reverse Fibonacci weighting function gets the higher recognition rate in three extended versions of KNN outperform others. Finally, we perform the emotional speech recognition experiments. The results show that it will be more robust if the visual information is included. The recognition rate of the audio-visual speech recognition system will have higher recognition rate when incorporated with the visual cues. Tsang-Long Pao 包蒼龍 2009 學位論文 ; thesis 106 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 博士 === 大同大學 === 資訊工程學系(所) === 97 === In recent years, there have been many machine speechreading systems proposed, that combine audio and visual speech features. For all such systems, the objective of these audio-visual speech recognizers is to improve recognition accuracy, particularly in difficult condition. This thesis presents a Mandarin audio-visual recognition system that has better recognition rate in noisy condition as well as speech spoken with emotional condition. We first extract the visual features of the lips, including geometric and motion features. These features are very important to the recognition system especially in noisy condition or with emotional effects. The motion features are obtained by applying an automatic face feature extractor followed by a fast motion feature extractor. We compare the performance when the system using motion and geometric features. In this recognition system, we propose to use the weighted-discrete KNN as the classifier and compare the results with two popular classifiers, the GMM and HMM, and evaluate their performance by applying to a Mandarin audio-visual speech corpus. We find that the WD-KNN is a suitable classifier for Mandarin speech because the monosyllable property of Mandarin and computationally inexpensive. The experimental results of different classifiers at various SNR levels are presented. The results show that using the WD-KNN classifier yields better recognition accuracy than other classifiers for the used Mandarin speech corpus. Several weighting functions were also studied for the weighted KNN based classifier, such as linear distance weighting, inverse distance weighting, rank weighting and reverse Fibonacci weighting function. The overall results have proved that WD-KNN classifier with reverse Fibonacci weighting function gets the higher recognition rate in three extended versions of KNN outperform others. Finally, we perform the emotional speech recognition experiments. The results show that it will be more robust if the visual information is included. The recognition rate of the audio-visual speech recognition system will have higher recognition rate when incorporated with the visual cues.
author2 Tsang-Long Pao
author_facet Tsang-Long Pao
Wen-Yuan Liao
廖文淵
author Wen-Yuan Liao
廖文淵
spellingShingle Wen-Yuan Liao
廖文淵
A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition
author_sort Wen-Yuan Liao
title A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition
title_short A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition
title_full A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition
title_fullStr A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition
title_full_unstemmed A Study on Audio-Visual Feature Extraction for Mandarin Digit Speech Recognition
title_sort study on audio-visual feature extraction for mandarin digit speech recognition
publishDate 2009
url http://ndltd.ncl.edu.tw/handle/46704732964354703864
work_keys_str_mv AT wenyuanliao astudyonaudiovisualfeatureextractionformandarindigitspeechrecognition
AT liàowényuān astudyonaudiovisualfeatureextractionformandarindigitspeechrecognition
AT wenyuanliao tīngshìjuétèzhēngxiéqǔzàizhōngwénshùzìyǔyīnbiànshízhīyánjiū
AT liàowényuān tīngshìjuétèzhēngxiéqǔzàizhōngwénshùzìyǔyīnbiànshízhīyánjiū
AT wenyuanliao studyonaudiovisualfeatureextractionformandarindigitspeechrecognition
AT liàowényuān studyonaudiovisualfeatureextractionformandarindigitspeechrecognition
_version_ 1718253431662051328