Summary: | In order to solve the problem of imbalance of data samples in multi-modal data, the resource-rich text modal know-ledge was used to model the resource-poor acoustic mode, and an emotion recognition neural network was constructed by using the similarity between auxiliary modes to supervise training. Firstly, the neural network with bi-GRU as the core was used to learn the initial feature vectors of the text and acoustic modalities. Secondly, the SoftMax function was used for emotion recognition prediction, and simultaneously a fully connected layer was used to generate the target feature vectors corresponding to the two modalities. Finally, the target feature vector assisted the supervised training by calculating the similarity between each other to improve the performance of emotion recognition. The results show that this neural network can perform four emotion classifications on the IEMOCAP data set to achieve a weighted accuracy of 82.6% and an unweighted accuracy of 81.3%. The research result provides a reference and method basis for emotion recognition and auxiliary modeling in the multi-modal field of artificial intelligence.
|