Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets

Speech emotion recognition (SER) is a natural method of recognizing individual emotions in everyday life. To distribute SER models to real-world applications, some key challenges must be overcome, such as the lack of datasets tagged with emotion labels and the weak generalization of the SER model fo...

Full description

Bibliographic Details
Main Authors:	Kyoung Ju Noh, Chi Yoon Jeong, Jiyoun Lim, Seungeun Chung, Gague Kim, Jeong Mook Lim, Hyuntae Jeong
Format:	Article
Language:	English
Published:	MDPI AG 2021-02-01
Series:	Sensors
Subjects:	speech emotion recognition domain adaptation SER generalization Korean Emotional Speech Database ensemble model multi-path
Online Access:	https://www.mdpi.com/1424-8220/21/5/1579

id	doaj-a4a9ddbb46124a6981d30824bd652772
record_format	Article
spelling	doaj-a4a9ddbb46124a6981d30824bd6527722021-02-25T00:03:18ZengMDPI AGSensors1424-82202021-02-01211579157910.3390/s21051579Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain DatasetsKyoung Ju Noh0Chi Yoon Jeong1Jiyoun Lim2Seungeun Chung3Gague Kim4Jeong Mook Lim5Hyuntae Jeong6Artificial Intelligence Research Lab., Electronics and Telecommunications Research Institute, Daejeon 34129, KoreaArtificial Intelligence Research Lab., Electronics and Telecommunications Research Institute, Daejeon 34129, KoreaArtificial Intelligence Research Lab., Electronics and Telecommunications Research Institute, Daejeon 34129, KoreaArtificial Intelligence Research Lab., Electronics and Telecommunications Research Institute, Daejeon 34129, KoreaArtificial Intelligence Research Lab., Electronics and Telecommunications Research Institute, Daejeon 34129, KoreaArtificial Intelligence Research Lab., Electronics and Telecommunications Research Institute, Daejeon 34129, KoreaArtificial Intelligence Research Lab., Electronics and Telecommunications Research Institute, Daejeon 34129, KoreaSpeech emotion recognition (SER) is a natural method of recognizing individual emotions in everyday life. To distribute SER models to real-world applications, some key challenges must be overcome, such as the lack of datasets tagged with emotion labels and the weak generalization of the SER model for an unseen target domain. This study proposes a multi-path and group-loss-based network (MPGLN) for SER to support multi-domain adaptation. The proposed model includes a bidirectional long short-term memory-based temporal feature generator and a transferred feature extractor from the pre-trained VGG-like audio classification model (VGGish), and it learns simultaneously based on multiple losses according to the association of emotion labels in the discrete and dimensional models. For the evaluation of the MPGLN SER as applied to multi-cultural domain datasets, the Korean Emotional Speech Database (KESD), including KESDy18 and KESDy19, is constructed, and the English-speaking Interactive Emotional Dyadic Motion Capture database (IEMOCAP) is used. The evaluation of multi-domain adaptation and domain generalization showed 3.7% and 3.5% improvements, respectively, of the F1 score when comparing the performance of MPGLN SER with a baseline SER model that uses a temporal feature generator. We show that the MPGLN SER efficiently supports multi-domain adaptation and reinforces model generalization.https://www.mdpi.com/1424-8220/21/5/1579speech emotion recognitiondomain adaptationSER generalizationKorean Emotional Speech Databaseensemble modelmulti-path
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Kyoung Ju Noh Chi Yoon Jeong Jiyoun Lim Seungeun Chung Gague Kim Jeong Mook Lim Hyuntae Jeong
spellingShingle	Kyoung Ju Noh Chi Yoon Jeong Jiyoun Lim Seungeun Chung Gague Kim Jeong Mook Lim Hyuntae Jeong Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets Sensors speech emotion recognition domain adaptation SER generalization Korean Emotional Speech Database ensemble model multi-path
author_facet	Kyoung Ju Noh Chi Yoon Jeong Jiyoun Lim Seungeun Chung Gague Kim Jeong Mook Lim Hyuntae Jeong
author_sort	Kyoung Ju Noh
title	Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets
title_short	Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets
title_full	Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets
title_fullStr	Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets
title_full_unstemmed	Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets
title_sort	multi-path and group-loss-based network for speech emotion recognition in multi-domain datasets
publisher	MDPI AG
series	Sensors
issn	1424-8220
publishDate	2021-02-01
description	Speech emotion recognition (SER) is a natural method of recognizing individual emotions in everyday life. To distribute SER models to real-world applications, some key challenges must be overcome, such as the lack of datasets tagged with emotion labels and the weak generalization of the SER model for an unseen target domain. This study proposes a multi-path and group-loss-based network (MPGLN) for SER to support multi-domain adaptation. The proposed model includes a bidirectional long short-term memory-based temporal feature generator and a transferred feature extractor from the pre-trained VGG-like audio classification model (VGGish), and it learns simultaneously based on multiple losses according to the association of emotion labels in the discrete and dimensional models. For the evaluation of the MPGLN SER as applied to multi-cultural domain datasets, the Korean Emotional Speech Database (KESD), including KESDy18 and KESDy19, is constructed, and the English-speaking Interactive Emotional Dyadic Motion Capture database (IEMOCAP) is used. The evaluation of multi-domain adaptation and domain generalization showed 3.7% and 3.5% improvements, respectively, of the F1 score when comparing the performance of MPGLN SER with a baseline SER model that uses a temporal feature generator. We show that the MPGLN SER efficiently supports multi-domain adaptation and reinforces model generalization.
topic	speech emotion recognition domain adaptation SER generalization Korean Emotional Speech Database ensemble model multi-path
url	https://www.mdpi.com/1424-8220/21/5/1579
work_keys_str_mv	AT kyoungjunoh multipathandgrouplossbasednetworkforspeechemotionrecognitioninmultidomaindatasets AT chiyoonjeong multipathandgrouplossbasednetworkforspeechemotionrecognitioninmultidomaindatasets AT jiyounlim multipathandgrouplossbasednetworkforspeechemotionrecognitioninmultidomaindatasets AT seungeunchung multipathandgrouplossbasednetworkforspeechemotionrecognitioninmultidomaindatasets AT gaguekim multipathandgrouplossbasednetworkforspeechemotionrecognitioninmultidomaindatasets AT jeongmooklim multipathandgrouplossbasednetworkforspeechemotionrecognitioninmultidomaindatasets AT hyuntaejeong multipathandgrouplossbasednetworkforspeechemotionrecognitioninmultidomaindatasets
_version_	1724252322595340288

Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets

Similar Items