Summary: | 博士 === 國立清華大學 === 電機工程學系 === 85 === When the speech recognition system is operated under
telephone networks, the acoustic mismatch between training and
testing environments always causes the performance degradation.
The mismatch sources in telephone environments areattributed to
the ambient noise, the channel effect and the variation among
speakers. This dissertation describes a number of robust
algorithms which improve the recognition performance by
compensating these three mismatch factors. In the experiments of
hidden Markov model (HMM) based speech recognition, the proposed
methods can successfully overcome the mismatch problems in
telephone environments. The noise effect on speech cepstral
vector and its associated HMM acoustic parameters is first
investigated. Due to the shrinkage of cepstral vector in noisy
environment, the projection-based likelihood measure which uses
an optimalequalization factor for adapting the cepstral mean
vector of HMM parameters is robust to noise contamination. This
dissertation extends this measure by further compensating the
shrinkage of covariance matrix and the bias of mean vector. The
compensation factors are obtained from a set of adaptation
functions. Using this method, the recognition accuracy can be
remarkably improved. To overcome the channel effect in
telephone speech, a channel-effect-cancellation method is
developed. This approach is to estimate a channel-effect-
cancellation filter by the convex combination of several
reference filters. The reference filters, represented in
cepstrum, are generated by clustering the cepstra of inverse
telephone channels. The convex combination coefficients are
calculated by the accumulated observation probabilities when the
testing utterance passes through the reference filters. Using
this method,the channel effect can be mostly canceled. Next,
this dissertation presents two transformation-based adaptation
approaches for adapting the HMM parameters so that the adapted
HMM parameters are acoustically close to the telephone
environment. The bias and the affine transformations are
examined. We apply the maximum a posteriori (MAP) estimation
technique which incorporates the prior knowledge into the
transformation for estimating the transformation parameters.In
our evaluation, the transformation-based adaptation using the
MAP estimationoutperforms that using the maximum likelihood (ML)
estimation. The affine transformation is also demonstrated to be
superior to the bias transformation. Furthermore, a phone-
dependent channel compensation (PDCC) technique is proposed for
adapting the HMM parameters to a new channel environment by
using some adaptation data. The adaptation of HMM parameters is
completed by incorporating the corresponding PDCC vectors. To
improve the performance, two extended PDCC techniques are
presented. One is based on the refinement of PDCC using vector
quantization. The other is based on the interpolation of
compensation vectors. This method is carried out and shown to be
effective in telephone speech recognition as well as speaker
adaptation. In addition, we also propose a hybrid algorithm
for adapting the HMM parameters to a new speaker. This algorithm
is constructed by iteratively and alternately combining three
adaptation techniques. First, the clusters of HMM parameters are
locally transformed through a group of transformation functions.
Then, the transformed HMM parameters are globally smoothed via
the MAP adaptation. Within the MAP adaptation, the parameters of
unseen units in adaptation data are further adapted by applying
the transfer vector interpolation scheme. Using this algorithm,
the advantages of these three adaptation techniques can be
simultaneously captured. The resulting performance is
consistently better than other methods for almost any practical
amount of adaptation data.
|