Summary: | 博士 === 國立臺灣大學 === 應用力學研究所 === 99 === The objective of this study is to develop an automatic speech emotion recognition method using Bayesian Network. By calculating the relevant features of emotion speech and comparing the features with emotion database, the speaker’s emotion state can be identified.
Firstly, we calculate the statistical features of pitch, frame energy, formants, mel-scale frequency cepstral coefficients (MFCC). Then we use the mean value of neutral emotion in corpus as normalized factor for each feature, and calculate the normalized features of pitch, frame energy and formants. The normalized features can reduce the feature difference between speakers.
Each feature has different ability of emotion recognition. For example, the normalized pitch mean can recognize sad and neutral, and happy and angry can consider as the same cluster. No features can obviously recognize the four emotions, so we use different cluster to recognize the four emotions layer by layer. We cluster the features which have similar ability of emotion recognition and establish the Multi-Layered Bayesian Network (MLBN) method for speech emotion recognition. The features of layer 1 can recognize two clusters of emotion. The features of layer 2 can recognize three clusters of emotion. The features which have no obvious clusters are put on layer 3 and recognize the four emotions.
There are some relations between each feature. Therefore, we extend the MLBN method and establish the Multi-Layered Bayesian Network with Covariance (MLBNC) method, which consider the relations between each feature, for speech emotion recognition.
The recognition rate will be poor if the training data of recognizer did not contain speaker’s speech emotion data. Therefore, we propose adaptive MLBN and MLBNC method for speech emotion recognition. In the adaptive MLBN and MLBNC process, we adjust the mean and standard deviation or covariance of clusters in the MLBN or MLBNC database to fit speaker’s real emotion status when the recognition result is wrong.
To verify the proposed method in this research, we use German emotional database (EMO-DB) as training and testing data for inside and outside test of KNN, SVM, MLBN and MLBNC recognizer. We also use EMO-DB as training data and ITRI emotional database as testing data for different corpus test. In the adaptive tests, we use EMO-DB as training data and ITRI emotional database as adaptive and testing data for adaptive KNN, MLBN and MLBNC recognizer.
The inside test recognition rate of MLBN, MLBNC and Bayesian Decision (BD) are 81.1%, 88.8% and 70.8% respectively. It shows that cluster of features layer by layer can effectively increase the recognition rate and it will be better when regards of the relations between each feature. In outside test, the recognition rate of KNN, SVM and MLBN are 78.2%, 89.1% and 69.9% respectively using original features and 82.6%, 91.7% and 77.6% respectively using normalized features. It shows that normalized features can reduce the feature difference between speakers and increase the recognition rate. In testing corpus is different with training, the recognition rate of KNN, SVM, MLBN and MLBNC are 34.21%, 46.92%, 39.33% and 52.08% respectively. It shows if speaker’s pronunciation or emotion presentation is different with training data, the recognition result is bad for each recognizer.
For adaptive emotion recognition test, adaptive KNN method can increase the recognition rate from 34.2% to 73.7%, adaptive MLBN method can increase from 37.8% to 82.4% and adaptive MLBNC method can increase from 51.6% to 81.2%. The proposed adaptive MLBN and MLBNC method of this study is better than adaptive KNN method. When adjustment times increase, the recognition rate of MLBN can increase from 39.3% to 88.9% and MLBNC can increase from 52.1% to 90.0%. It shows that adaptive MLBN and MLBNC method can really reflect the real status of speaker’s emotion state and get good recognition results after appropriate adjustment.
|