Frequency-dependent auto-pooling function for weakly supervised sound event detection

Abstract Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only cont...

Full description

Bibliographic Details
Main Authors:	Sichen Liu, Feiran Yang, Yin Cao, Jun Yang
Format:	Article
Language:	English
Published:	SpringerOpen 2021-05-01
Series:	EURASIP Journal on Audio, Speech, and Music Processing
Subjects:	Sound event detection Weakly supervised Auto-pooling function Depthwise separable convolution
Online Access:	https://doi.org/10.1186/s13636-021-00206-7

id	doaj-cefe15ceed614d479c9df4dd3266080a
record_format	Article
spelling	doaj-cefe15ceed614d479c9df4dd3266080a2021-05-23T11:24:01ZengSpringerOpenEURASIP Journal on Audio, Speech, and Music Processing1687-47222021-05-012021111110.1186/s13636-021-00206-7Frequency-dependent auto-pooling function for weakly supervised sound event detectionSichen Liu0Feiran Yang1Yin Cao2Jun Yang3Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of SciencesUniversity of Chinese Academy of SciencesCentre for Vision, Speech and Signal Processing (CVSSP), University of SurreyKey Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of SciencesAbstract Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only contain audio tags without precise temporal information. This type of dataset is therefore classified as weakly labeled dataset. In this paper, we propose a novel source separation-based method trained on weakly labeled data to solve SED problems. We build a dilated depthwise separable convolution block (DDC-block) to estimate time-frequency (T-F) masks of each sound event from a T-F representation of an audio clip. DDC-block is experimentally proven to be more effective and computationally lighter than “VGG-like” block. To fully utilize frequency characteristics of sound events, we then propose a frequency-dependent auto-pooling (FAP) function to obtain the clip-level present probability of each sound event class. A combination of two schemes, named DDC-FAP method, is evaluated on DCASE 2018 Task 2, DCASE 2020 Task4, and DCASE 2017 Task 4 datasets. The results show that DDC-FAP has a better performance than the state-of-the-art source separation-based method in SED task.https://doi.org/10.1186/s13636-021-00206-7Sound event detectionWeakly supervisedAuto-pooling functionDepthwise separable convolution
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Sichen Liu Feiran Yang Yin Cao Jun Yang
spellingShingle	Sichen Liu Feiran Yang Yin Cao Jun Yang Frequency-dependent auto-pooling function for weakly supervised sound event detection EURASIP Journal on Audio, Speech, and Music Processing Sound event detection Weakly supervised Auto-pooling function Depthwise separable convolution
author_facet	Sichen Liu Feiran Yang Yin Cao Jun Yang
author_sort	Sichen Liu
title	Frequency-dependent auto-pooling function for weakly supervised sound event detection
title_short	Frequency-dependent auto-pooling function for weakly supervised sound event detection
title_full	Frequency-dependent auto-pooling function for weakly supervised sound event detection
title_fullStr	Frequency-dependent auto-pooling function for weakly supervised sound event detection
title_full_unstemmed	Frequency-dependent auto-pooling function for weakly supervised sound event detection
title_sort	frequency-dependent auto-pooling function for weakly supervised sound event detection
publisher	SpringerOpen
series	EURASIP Journal on Audio, Speech, and Music Processing
issn	1687-4722
publishDate	2021-05-01
description	Abstract Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only contain audio tags without precise temporal information. This type of dataset is therefore classified as weakly labeled dataset. In this paper, we propose a novel source separation-based method trained on weakly labeled data to solve SED problems. We build a dilated depthwise separable convolution block (DDC-block) to estimate time-frequency (T-F) masks of each sound event from a T-F representation of an audio clip. DDC-block is experimentally proven to be more effective and computationally lighter than “VGG-like” block. To fully utilize frequency characteristics of sound events, we then propose a frequency-dependent auto-pooling (FAP) function to obtain the clip-level present probability of each sound event class. A combination of two schemes, named DDC-FAP method, is evaluated on DCASE 2018 Task 2, DCASE 2020 Task4, and DCASE 2017 Task 4 datasets. The results show that DDC-FAP has a better performance than the state-of-the-art source separation-based method in SED task.
topic	Sound event detection Weakly supervised Auto-pooling function Depthwise separable convolution
url	https://doi.org/10.1186/s13636-021-00206-7
work_keys_str_mv	AT sichenliu frequencydependentautopoolingfunctionforweaklysupervisedsoundeventdetection AT feiranyang frequencydependentautopoolingfunctionforweaklysupervisedsoundeventdetection AT yincao frequencydependentautopoolingfunctionforweaklysupervisedsoundeventdetection AT junyang frequencydependentautopoolingfunctionforweaklysupervisedsoundeventdetection
_version_	1721429886457872384

Frequency-dependent auto-pooling function for weakly supervised sound event detection

Similar Items