Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network

Speech emotion recognition is a vital and challenging task that the feature extraction plays a significant role in the SER performance. With the development of deep learning, we put our eyes on the structure of end-to-end and authenticate the algorithm that is extraordinary effective. In this paper,...

Full description

Bibliographic Details
Main Authors:	Hao Meng, Tianhao Yan, Fei Yuan, Hongwei Wei
Format:	Article
Language:	English
Published:	IEEE 2019-01-01
Series:	IEEE Access
Subjects:	3-D Log-Mel dilated CNN residual block center loss BiLSTM attention mechanism
Online Access:	https://ieeexplore.ieee.org/document/8817913/

id	doaj-5c35727478634da39940ae8c067c1dee
record_format	Article
spelling	doaj-5c35727478634da39940ae8c067c1dee2021-03-29T23:20:36ZengIEEEIEEE Access2169-35362019-01-01712586812588110.1109/ACCESS.2019.29380078817913Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning NetworkHao Meng0Tianhao Yan1https://orcid.org/0000-0003-1851-6075Fei Yuan2https://orcid.org/0000-0003-3985-0726Hongwei Wei3College of Automation, Institute of Robotics and Intelligent Control, Harbin Engineering University, Harbin, ChinaCollege of Automation, Institute of Robotics and Intelligent Control, Harbin Engineering University, Harbin, ChinaCollege of Automation, Institute of Robotics and Intelligent Control, Harbin Engineering University, Harbin, ChinaCollege of Automation, Institute of Robotics and Intelligent Control, Harbin Engineering University, Harbin, ChinaSpeech emotion recognition is a vital and challenging task that the feature extraction plays a significant role in the SER performance. With the development of deep learning, we put our eyes on the structure of end-to-end and authenticate the algorithm that is extraordinary effective. In this paper, we introduce a novel architecture ADRNN (dilated CNN with residual block and BiLSTM based on the attention mechanism) to apply for the speech emotion recognition which can take advantage of the strengths of diverse networks and overcome the shortcomings of utilizing alone, and are evaluated in the popular IEMOCAP database and Berlin EMODB corpus. Dilated CNN can assist the model to acquire more receptive fields than using the pooling layer. Then, the skip connection can keep more historic info from the shallow layer and BiLSTM layer are adopted to learn long-term dependencies from the learned local features. And we utilize the attention mechanism to enhance further extraction of speech features. Furthermore, we improve the loss function to apply softmax together with the center loss that achieves better classification performance. As emotional dialogues are transformed of the spectrograms, we pick up the values of the 3-D Log-Mel spectrums from raw signals and put them into our proposed algorithm and obtain a notable performance to get the 74.96% unweighted accuracy in the speaker-dependent and the 69.32% unweighted accuracy in the speaker-independent experiment. It is better than the 64.74% from previous state-of-the-art methods in the spontaneous emotional speech of the IEMOCAP database. In addition, we propose the networks that achieve recognition accuracies of 90.78% and 85.39% on Berlin EMODB of speaker-dependent and speaker-independent experiment respectively, which are better than the accuracy of 88.30% and 82.82% obtained by previous work. For validating the robustness and generalization, we also make an experiment for cross-corpus between above databases and get the preferable 63.84% recognition accuracy in final.https://ieeexplore.ieee.org/document/8817913/3-D Log-Meldilated CNNresidual blockcenter lossBiLSTMattention mechanism
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Hao Meng Tianhao Yan Fei Yuan Hongwei Wei
spellingShingle	Hao Meng Tianhao Yan Fei Yuan Hongwei Wei Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network IEEE Access 3-D Log-Mel dilated CNN residual block center loss BiLSTM attention mechanism
author_facet	Hao Meng Tianhao Yan Fei Yuan Hongwei Wei
author_sort	Hao Meng
title	Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network
title_short	Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network
title_full	Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network
title_fullStr	Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network
title_full_unstemmed	Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network
title_sort	speech emotion recognition from 3d log-mel spectrograms with deep learning network
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2019-01-01
description	Speech emotion recognition is a vital and challenging task that the feature extraction plays a significant role in the SER performance. With the development of deep learning, we put our eyes on the structure of end-to-end and authenticate the algorithm that is extraordinary effective. In this paper, we introduce a novel architecture ADRNN (dilated CNN with residual block and BiLSTM based on the attention mechanism) to apply for the speech emotion recognition which can take advantage of the strengths of diverse networks and overcome the shortcomings of utilizing alone, and are evaluated in the popular IEMOCAP database and Berlin EMODB corpus. Dilated CNN can assist the model to acquire more receptive fields than using the pooling layer. Then, the skip connection can keep more historic info from the shallow layer and BiLSTM layer are adopted to learn long-term dependencies from the learned local features. And we utilize the attention mechanism to enhance further extraction of speech features. Furthermore, we improve the loss function to apply softmax together with the center loss that achieves better classification performance. As emotional dialogues are transformed of the spectrograms, we pick up the values of the 3-D Log-Mel spectrums from raw signals and put them into our proposed algorithm and obtain a notable performance to get the 74.96% unweighted accuracy in the speaker-dependent and the 69.32% unweighted accuracy in the speaker-independent experiment. It is better than the 64.74% from previous state-of-the-art methods in the spontaneous emotional speech of the IEMOCAP database. In addition, we propose the networks that achieve recognition accuracies of 90.78% and 85.39% on Berlin EMODB of speaker-dependent and speaker-independent experiment respectively, which are better than the accuracy of 88.30% and 82.82% obtained by previous work. For validating the robustness and generalization, we also make an experiment for cross-corpus between above databases and get the preferable 63.84% recognition accuracy in final.
topic	3-D Log-Mel dilated CNN residual block center loss BiLSTM attention mechanism
url	https://ieeexplore.ieee.org/document/8817913/
work_keys_str_mv	AT haomeng speechemotionrecognitionfrom3dlogmelspectrogramswithdeeplearningnetwork AT tianhaoyan speechemotionrecognitionfrom3dlogmelspectrogramswithdeeplearningnetwork AT feiyuan speechemotionrecognitionfrom3dlogmelspectrogramswithdeeplearningnetwork AT hongweiwei speechemotionrecognitionfrom3dlogmelspectrogramswithdeeplearningnetwork
_version_	1724189643182702592

Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network

Similar Items