A study on a few relevant problems about machine dictation of mandarin speech

博士 === 國立臺灣大學 === 資訊工程研究所 === 78 === Machine dictation of Mandarin speech is a long term research goal of many graduate students in the speech laboratory of National Taiwan University. In this dissertation, this problem of a Mandarin dictation machine has been carefu...

Full description

Bibliographic Details
Main Authors: GU,HONG-YAN, 古鴻炎
Other Authors: LI,LIN-SHAN
Format: Others
Language:zh-TW
Published: 1990
Online Access:http://ndltd.ncl.edu.tw/handle/23523326772467276115
Description
Summary:博士 === 國立臺灣大學 === 資訊工程研究所 === 78 === Machine dictation of Mandarin speech is a long term research goal of many graduate students in the speech laboratory of National Taiwan University. In this dissertation, this problem of a Mandarin dictation machine has been carefully considered and formulated based on the general research methodology developed in this laboratory, in which statistical approaches are used and the problem is divided into several relevant subproblems including base syllable recognition (disregarding the tones), full Mandarin tone recognition and Chinese language modeling. In this dissertation, these subproblems have been carefully studied and some notable results are obtained. For the base syllable recognition, a comparative study on the performance of several speech recognition techniques was conducted first, which include the dymamic time warping (DTW), the newly proposed DTW with superimposed weighting function (DTWW), the discrete hidden Markov models (DHMM) and the continuous hidden Markov models (CHMM). After a series of experiments, it was found that the recognition rate of the nwely proposed DTWW (88.3) is higher than that of DTW (85.1), DHMM (65.0) and CHMM (83.9), and that the CPU time used for DTWW is 1.03 times that for DTW, 24 times that for DHMM and 4.3 times that for CHMM. In addition, the memory space required for DTWW and DTW is 3.4 times that of DHMM and 8.5 times that of CHMM. Therefore, DTWW has the highest recognition rate, DHMMhas the fastest recognition speed, whereas CHMM appears to be very attractive when all the different factors incouding the recognition rate, recognition speed and memory space requirement are considered. Following the above comparative study, the speech recognition technique of HMM''s has been further investigated and Hidden Markov Models with Bounded State Kurations (HMM/BSD) are proposed to explicitly model the state durations of HMM''s and more accurately consider the temporal structures existing in speech signalas for better recognition of the highly confusing Mandarin base syllables. In addition, a new computation algorithm to be used in the recognition phase of HMM''s is also proposed for all approaches which attempt to explicitly model the state durations of HMM''s. As compared to the approaches of using Poisson, gammar or other distributions to model state durations proposed previously, HMM/BSD is simpler, more direct and dffective according to the experimental results. It was found that in discrete case the recognition rate of HMM/BSD (78.5) is 9.0%, 6.3% and 1.9% higher than the conventional HMM''s and HMM''s with Poisson and gamma distributed state durations, respectively. In continuous case (partitioned Gaussian mixture modeling) the recognition rates of HMM/BSD (88.3% with 1 mixture, 88.8 with 3 mixtures nad 89.4 with 5 mixtures) are 6.3%, 5.0% and 5.5% higher than that of the conventional HMM''s, and 5.9%(with 1 mixture), 3.9%(with 3 mixtures) and 3.1%(with 1 mixture), 1.8%(with 3 mixture), higher than HMM''s with Poisson and gamma distributed state durations, respectively. Furthermore, the recognition rate of HMM/BSD can be increased to 91.3 if the dynamic cepstrum features are included. As to Mandarin tone recognition, the neutral tone is the most difficult to distinguish, and previous works all concentrated on the recognition of the other four tones while the neutral tone was temporarily ignored. Therefore, a study on the full Mandarin tone recognition (i.e. including the neutral tone) for isolated syllables is conducted. The study includes experiments for both the four-tone (excluding the neutral tone) and five-tone (including the neutral tone) recognition for comparison, and both the speaker dependent and speaker adaptive modes ard examined. Various versions of HMM''s including discrete HMM''s continuous HMM''s and the modified version of HMM''s with bounded state durations are considered. It was found that the performance of the previously proposed methods will be significantly degraded when the neutral tone is included, and with the neutral tone a new form of feature vectors proposed here together with bounded state durations will provide much better recognition rates in speaker dependent (96.1) and adaptive (90.5) modes. For the phonetic sequence obtained through base syllable recognition and tone recognitiom, the problem faced by the linguistic decoder of a Mandarin dictation machine is that the high degree of ambiguities caused by homonyms should be clarified, and the errors made in acoustic speech recognition should be corrected by the decoding process. In this dissertation this problem is formally formulated in which the phonetic input can be either deterministic or probabilistic, and Markov models for Chinese language is developed to suitable for parallel processing based on dynamic programing is further time. Extensive experiments were performed and the results show that the model can not only effectively decode the Mandarin phonetic input sequences, but successfully correct the speech recognition errors and significantly improve the final recognition rates in a Mandarin dictation machine. It is also suitable for real-time applications. In addition to independently studying these subproblems, some integrated simulations of the entire Mandarin dictation machine have also been conducted to see the integrated function and overall performance of the various techniques developed here. A prototype system coperatively implemented by many students in the speech laboratory is also briefly described. This system serves as an example to show the practical applicability of the techniques described in this dissertation. A       Speech signal of a Chinese sentence. A*      Speech signal of Wi. a**      State transition probability from state * to *. B(*)     Upper constraint function for DTW. b** b* (κ)  Observation producton probability for state * to produce event κ. b (*)     Lower constraint function for DTW. CHMM     Continuous hidden Markov model. **      Category of W*. DTW      Dynamic time warping. DTWW     Dynamic time warping with superimposed weighting function. DHMM     Discrete hidden Markov model. Dy      Dictionary of Chinese characters or words. d(***)    Distance function. d       Order of Markov modeling. du      Duration staying at a state. du(*)     Number of analysis frames exspended in state *. E       Event set. en      Maximum short-time energy of the voice part in a syllable. ent      Short-time energy of the t-th analysis frame in the voiced part. ***      means code * is the ***-th nearest neighbor of code *. F       Pitch frequency reference base for a sperker. ft      Pitch frequency of the t-th analysis frame in the voiced part. Fs*      Feature parameters measured from A* and are relevant to base syllable recognition. Ft*      Feature parameters measured from A* and are relevant to lexical tone recognition. g*      Gain of the *-th mixture. HMM      Hidden Markov model. HMMg     Hidden Markov modeling. K       Total number of training utterances. L(*,Dy)    Word (or character) lattice formed by * and Dy. l*      Lower state duration bound for state *. M       Number of mixtures in Gaussian mixture modeling. Mkt      Number of elements in Wkt. MDI      Minimum discrimination information. ML      Maximum likelihood. MM      Markov model. MMg      Markov modeling. MMI      Maximum mutual information.