Summary: | Spontaneous speech emotion recognition is a new and challenging research topic. In this paper, we propose a new method of spontaneous speech emotion recognition on the basis of binaural representations and deep convolutional neural networks (CNNs). The proposed method initially employs multiple CNNs to learn deep segment-level binaural representations such as Left-Right and Mid-Side pairs from the extracted image-like Mel-spectrograms. These CNNs are fine-tuned on target emotional speech datasets from a pre-trained image CNN model. Then, a new feature pooling strategy, called block-based temporal feature pooling, is proposed to aggregate the learned segment-level features for producing fixed-length utterance-level features. Based on the utterance-level features, linear support vector machines (SVM) is adopted for emotion classification. Finally, a two-stage score-level fusion strategy is used to integrate the obtained results from Left-Right and Mid-Side pairs. Extensive experiments on two challenging spontaneous emotional speech datasets, including the AFEW5.0 and BAUM-1s databases, demonstrate the effectiveness of our proposed method.
|