Summary: | 博士 === 國立臺灣大學 === 資訊工程研究所 === 78 === Machine dictation of Mandarin speech is a long term research goal of many
graduate students in the speech laboratory of National Taiwan University.
In this dissertation, this problem of a Mandarin dictation machine has
been carefully considered and formulated based on the general research
methodology developed in this laboratory, in which statistical approaches
are used and the problem is divided into several relevant subproblems
including base syllable recognition (disregarding the tones), full
Mandarin tone recognition and Chinese language modeling. In this
dissertation, these subproblems have been carefully studied and some
notable results are obtained.
For the base syllable recognition, a comparative study on the performance
of several speech recognition techniques was conducted first, which
include the dymamic time warping (DTW), the newly proposed DTW with
superimposed weighting function (DTWW), the discrete hidden Markov models
(DHMM) and the continuous hidden Markov models (CHMM). After a series of
experiments, it was found that the recognition rate of the nwely proposed
DTWW (88.3) is higher than that of DTW (85.1), DHMM (65.0) and
CHMM (83.9), and that the CPU time used for DTWW is 1.03 times that for
DTW, 24 times that for DHMM and 4.3 times that for CHMM. In addition, the
memory space required for DTWW and DTW is 3.4 times that of DHMM and 8.5
times that of CHMM. Therefore, DTWW has the highest recognition rate,
DHMMhas the fastest recognition speed, whereas CHMM appears to be very
attractive when all the different factors incouding the recognition rate,
recognition speed and memory space requirement are considered.
Following the above comparative study, the speech recognition technique of
HMM''s has been further investigated and Hidden Markov Models with Bounded
State Kurations (HMM/BSD) are proposed to explicitly model the state
durations of HMM''s and more accurately consider the temporal structures
existing in speech signalas for better recognition of the highly confusing
Mandarin base syllables. In addition, a new computation algorithm to be
used in the recognition phase of HMM''s is also proposed for all approaches
which attempt to explicitly model the state durations of HMM''s. As
compared to the approaches of using Poisson, gammar or other distributions
to model state durations proposed previously, HMM/BSD is simpler, more
direct and dffective according to the experimental results. It was found
that in discrete case the recognition rate of HMM/BSD (78.5) is 9.0%, 6.3%
and 1.9% higher than the conventional HMM''s and HMM''s with Poisson and
gamma distributed state durations, respectively. In continuous case
(partitioned Gaussian mixture modeling) the recognition rates of HMM/BSD
(88.3% with 1 mixture, 88.8 with 3 mixtures nad 89.4 with 5 mixtures) are
6.3%, 5.0% and 5.5% higher than that of the conventional HMM''s, and
5.9%(with 1 mixture), 3.9%(with 3 mixtures) and 3.1%(with 1 mixture),
1.8%(with 3 mixture), higher than HMM''s with Poisson and gamma distributed
state durations, respectively. Furthermore, the recognition rate of
HMM/BSD can be increased to 91.3 if the dynamic cepstrum features are
included.
As to Mandarin tone recognition, the neutral tone is the most difficult to
distinguish, and previous works all concentrated on the recognition of the
other four tones while the neutral tone was temporarily ignored.
Therefore, a study on the full Mandarin tone recognition (i.e. including
the neutral tone) for isolated syllables is conducted. The study includes
experiments for both the four-tone (excluding the neutral tone) and
five-tone (including the neutral tone) recognition for comparison, and
both the speaker dependent and speaker adaptive modes ard examined.
Various versions of HMM''s including discrete HMM''s continuous HMM''s and
the modified version of HMM''s with bounded state durations are considered.
It was found that the performance of the previously proposed methods will
be significantly degraded when the neutral tone is included, and with the
neutral tone a new form of feature vectors proposed here together with
bounded state durations will provide much better recognition rates in
speaker dependent (96.1) and adaptive (90.5) modes.
For the phonetic sequence obtained through base syllable recognition and
tone recognitiom, the problem faced by the linguistic decoder of a
Mandarin dictation machine is that the high degree of ambiguities caused
by homonyms should be clarified, and the errors made in acoustic speech
recognition should be corrected by the decoding process. In this
dissertation this problem is formally formulated in which the phonetic
input can be either deterministic or probabilistic, and Markov models for
Chinese language is developed to suitable for parallel processing based on
dynamic programing is further time. Extensive experiments were performed
and the results show that the model can not only effectively decode the
Mandarin phonetic input sequences, but successfully correct the speech
recognition errors and significantly improve the final recognition rates
in a Mandarin dictation machine. It is also suitable for real-time
applications.
In addition to independently studying these subproblems, some integrated
simulations of the entire Mandarin dictation machine have also been
conducted to see the integrated function and overall performance of the
various techniques developed here. A prototype system coperatively
implemented by many students in the speech laboratory is also briefly
described. This system serves as an example to show the practical
applicability of the techniques described in this dissertation.
A Speech signal of a Chinese sentence.
A* Speech signal of Wi.
a** State transition probability from state * to *.
B(*) Upper constraint function for DTW.
b** b* (κ) Observation producton probability for state * to produce
event κ.
b (*) Lower constraint function for DTW.
CHMM Continuous hidden Markov model.
** Category of W*.
DTW Dynamic time warping.
DTWW Dynamic time warping with superimposed weighting function.
DHMM Discrete hidden Markov model.
Dy Dictionary of Chinese characters or words.
d(***) Distance function.
d Order of Markov modeling.
du Duration staying at a state.
du(*) Number of analysis frames exspended in state *.
E Event set.
en Maximum short-time energy of the voice part in a syllable.
ent Short-time energy of the t-th analysis frame in the voiced
part.
*** means code * is the ***-th nearest neighbor of code *.
F Pitch frequency reference base for a sperker.
ft Pitch frequency of the t-th analysis frame in the voiced
part.
Fs* Feature parameters measured from A* and are relevant to
base syllable recognition.
Ft* Feature parameters measured from A* and are relevant to
lexical tone recognition.
g* Gain of the *-th mixture.
HMM Hidden Markov model.
HMMg Hidden Markov modeling.
K Total number of training utterances.
L(*,Dy) Word (or character) lattice formed by * and Dy.
l* Lower state duration bound for state *.
M Number of mixtures in Gaussian mixture modeling.
Mkt Number of elements in Wkt.
MDI Minimum discrimination information.
ML Maximum likelihood.
MM Markov model.
MMg Markov modeling.
MMI Maximum mutual information.
|