Summary: | 碩士 === 國立成功大學 === 資訊工程學系碩博士班 === 101 === In recent years, with the development of affective computing, emotion recognition is a critical topic in creating an intelligent human-computer interface. Speech is one of the most efficient ways for human-human communication. Therefore to make machines able to communicate with humans more effectively, understanding the information carried by speech such as emotions and speech intentions is a very important technique. In this thesis, we focused on the technique to detect emotions in speech.
This thesis proposed an approach to speech emotion recognition using multi-level temporal information. To achieve this goal, Multi-level Unit Chunking is first employed to segment different temporal levels of emotional units and then the Hierarchical Correlation Model is used to integrate the information from those emotional units. For the Multi-level Unit Chunking, edge detection algorithm is employed to locate the boundaries of change and yield the emotional units automatically. Three types of chunking units will be determined for each utterance: basic unit, sub-emotion unit and emotion unit, within which consistent properties are shown in terms of spectral energy, prosodic feature, and emotion profile, respectively.
After locating the units for each level, a Hierarchical Correlation Model is proposed to model the hierarchical utterance structure. For each unit, static features are extracted and converted to emotion profile vectors as its soft-labeling emotion. Single segmentation level models are trained using the emotion profile vectors, which are weighted by the duration of its corresponding unit. To measure the correlation between the units, vector quantization is exploited using k-means clustering algorithm. The quantized vector of each unit is determined by the closest cluster. The correlation is calculated statistically and fused with the results from each single temporal level model. The final decision of the utterance will be determined by choosing the highest score.
The proposed approach was evaluated on Berlin Emotional Speech Database (EMO-DB) and the recognition results showed that the proposed speech emotion recognition system achieved 71.69% accuracy, which outperforms previously approaches. After using speaker normalization, the performance reaches 83.55% accuracy in six emotion recognition.
|