Robust Features and Efficient Models for Speaker Identification

博士 === 國立清華大學 === 電機工程學系 === 88 === The objective of this dissertation is to find robust features and efficient models to improve the speaker recognition performance. Two types of robust features are presented. One is robust to additive noise, and the other is robust to the coexistence of additive a...

Full description

Bibliographic Details
Main Authors: Kuo-Hwei Yuo, 游國輝
Other Authors: Hsiao-Chuan Wang
Format: Others
Language:zh-TW
Published: 2000
Online Access:http://ndltd.ncl.edu.tw/handle/58137971312172176468
Description
Summary:博士 === 國立清華大學 === 電機工程學系 === 88 === The objective of this dissertation is to find robust features and efficient models to improve the speaker recognition performance. Two types of robust features are presented. One is robust to additive noise, and the other is robust to the coexistence of additive and convolutional noises. In addition, we present two statistical models that depict a speaker’s feature space more efficiently than the classical method using Gaussian mixture model with diagonal covariance matrices. The first robust feature is based on filtering the temporal trajectories of short-time one-sided autocorrelation sequences of speech to remove the additive noise. The filtered sequences are denoted the relative autocorrelation sequences (RAS), and the mel-scale frequency cepstral coefficients (MFCC) are extracted from RAS instead of the original speech. This new speech feature set is denoted RAS-MFCC. The second robust feature is based on involving two steps of temporal trajectory filtering. The first filtering is applied in autocorrelation domain to remove the additive noise, and the second filtering is applied in logarithmic spectrum domain to remove the convolutional noise. The filtered sequence is called CHAnnel-Normalization Relative Autocorrelation Sequence (CHANRAS). The MFCCs are extracted from CHARAS and called CHARAS-MFCC. The RAS-MFCC is a special case of CHARAS-MFCC. We conduct experiments under a variety of noisy environments including additive and convolutional noises. The RAS-MFCC and CHARAS-MFCC are shown to be superior to projection method. The RAS-MFCC and the CHARAS-MFCC combining with projection measure can further improve identification accuracy. Next, we present a new GMM structure that can depict the speaker’s feature space more efficiently than the traditional GMM structure. This is based on that we embed a common uncorrelated transformation matrix to all Gaussian pdfs. The idea is similar to a classical approach derived from the Karhunen-Loéve transformation. However both algorithms to derive the trasformation matrix are inherently different. The proposed new GMM is called transformation embedded GMM (TE-GMM). The transformation matrix of TE-GMM as well as the other model parameters could be trained simltaneously using maximum likelihood estimation. Then we generalizes the one transformation used in TE-EMM to multiple transformations. We derive a new GMM, called General Covariance GMM (GC-GMM). The GMM with diagonal covariance matrices is denoted as DC-GMM (Diagonal Covariance GMM). The GMM with full covariance matrices is denoted as FC-GMM (Full Covariance GMM). Both DC-GMM and FC-GMM are special cases of GC-GMM. The experimental results show that the TE-GMM can achieve a better accuracy than the classical Karhunen-Loéve transformation method. The experimental results also show that, in comparison with the traditional GMM, the GC-GMM can reduce the computational complexity and the number of parameters significantly without degradation in system performance.