Summary: | 博士 === 國立成功大學 === 電機工程學系 === 107 === This study integrates methods of speech source localization and speech recognition, and proposes the automatic speech recognition (ASR) system which can additionally provide the angular information of speeches. According to the analysis of received speech signals, the proposed ASR system can estimate the direction of speech signals and display the speech recognition results in low signal-to-noise ratio (SNR) environments. The proposed ASR system comprises two stages, which are speech source localization and speech recognition procedure. This study aims to improve the performance of the ASR system with proposed methods. The proposed methods outperform the angular estimation and speech recognition rate in noisy environments.
In the speech source localization processing, this study proposes a preprocessing scheme to reduce the estimated error of the direction of arrival (DOA) estimation according to the investigation of average magnitude difference function (AMDF), minimum variance distortionless response (MVDR), and multiple signal classification (MUSIC). The proposed preprocessing method utilizes linear phase approximation to predict the ideal phase line in the absence of noise, and reconstructs the covariance matrix of the received speech signal. To increase the accuracy of DOA result, another method based on eigenvalue decomposition (EVD) is adopted to detect and filter out the noisy frequency bins of the speech signals.
This study reveals a threshold-based noise detection method in the speech recognition procedure. The proposed method can automatically calculate and record the current SNR value of the speech signal. The ASR system can determine when to enhance the quality of speech based on the SNR value of the collected speech signal. This method can avoid the situation of over-filtering speech, which can decrease the speech recognition rate. In noise reduction stage, independent component analysis (ICA) and subspace speech enhancement (SSE) are employed to eliminate the noise from the received speech and enhance the magnitude of the speech for recognition process. This study uses hidden Markov model toolkit (HTK), which is developed at the machine intelligence laboratory of the Cambridge University Engineering Department, as a speech recognizer in recognition process. The HTK-based speech recognizer analyzes the enhanced speech signal and demonstrates the speech recognition result.
The experiments in this study indicate that the proposed ASR system can effectively estimate the direction of speech signal and recognize the content of speech signal in noisy environments. Compared with conventional MVDR and MUSIC algorithms, the mean estimation error using proposed preprocessing scheme can be reduced by about 4.98° from the MVDR method. The DOA results also improve the mean estimation accuracy by around 7.61° relative to the MUSIC method. With respect to the performance of noise reduction and speech recognition rate, the results reveal that the SNR values of the enhanced speech exceed those of the noise-contaminated speech by approximately 10 dB to 15 dB. The speech recognition rates can be improved by around 12% to 17% after the proposed noise detection and reduction methods. The study also investigates the technique of artificial intelligence (AI) such as the deep learning method. Both of the deep neural network (DNN) and the convolutional-recurrent neural network (CRNN) are introduced and proposed in noise reduction process. In experiments, the score of perceptual evaluation of speech quality (PESQ) can be improved to 0.83 scores compared with the noisy speech. The word error rate can be decreased to 15.83 %. The experimental results demonstrate that the CRNN model can validly suppress the influence of noise signal and improve the quality of speech for recognition process in the ASR system.
|