Frequency-dependent auto-pooling function for weakly supervised sound event detection

Abstract Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only cont...

Full description

Bibliographic Details
Main Authors: Sichen Liu, Feiran Yang, Yin Cao, Jun Yang
Format: Article
Language:English
Published: SpringerOpen 2021-05-01
Series:EURASIP Journal on Audio, Speech, and Music Processing
Subjects:
Online Access:https://doi.org/10.1186/s13636-021-00206-7
id doaj-cefe15ceed614d479c9df4dd3266080a
record_format Article
spelling doaj-cefe15ceed614d479c9df4dd3266080a2021-05-23T11:24:01ZengSpringerOpenEURASIP Journal on Audio, Speech, and Music Processing1687-47222021-05-012021111110.1186/s13636-021-00206-7Frequency-dependent auto-pooling function for weakly supervised sound event detectionSichen Liu0Feiran Yang1Yin Cao2Jun Yang3Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of SciencesUniversity of Chinese Academy of SciencesCentre for Vision, Speech and Signal Processing (CVSSP), University of SurreyKey Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of SciencesAbstract Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only contain audio tags without precise temporal information. This type of dataset is therefore classified as weakly labeled dataset. In this paper, we propose a novel source separation-based method trained on weakly labeled data to solve SED problems. We build a dilated depthwise separable convolution block (DDC-block) to estimate time-frequency (T-F) masks of each sound event from a T-F representation of an audio clip. DDC-block is experimentally proven to be more effective and computationally lighter than “VGG-like” block. To fully utilize frequency characteristics of sound events, we then propose a frequency-dependent auto-pooling (FAP) function to obtain the clip-level present probability of each sound event class. A combination of two schemes, named DDC-FAP method, is evaluated on DCASE 2018 Task 2, DCASE 2020 Task4, and DCASE 2017 Task 4 datasets. The results show that DDC-FAP has a better performance than the state-of-the-art source separation-based method in SED task.https://doi.org/10.1186/s13636-021-00206-7Sound event detectionWeakly supervisedAuto-pooling functionDepthwise separable convolution
collection DOAJ
language English
format Article
sources DOAJ
author Sichen Liu
Feiran Yang
Yin Cao
Jun Yang
spellingShingle Sichen Liu
Feiran Yang
Yin Cao
Jun Yang
Frequency-dependent auto-pooling function for weakly supervised sound event detection
EURASIP Journal on Audio, Speech, and Music Processing
Sound event detection
Weakly supervised
Auto-pooling function
Depthwise separable convolution
author_facet Sichen Liu
Feiran Yang
Yin Cao
Jun Yang
author_sort Sichen Liu
title Frequency-dependent auto-pooling function for weakly supervised sound event detection
title_short Frequency-dependent auto-pooling function for weakly supervised sound event detection
title_full Frequency-dependent auto-pooling function for weakly supervised sound event detection
title_fullStr Frequency-dependent auto-pooling function for weakly supervised sound event detection
title_full_unstemmed Frequency-dependent auto-pooling function for weakly supervised sound event detection
title_sort frequency-dependent auto-pooling function for weakly supervised sound event detection
publisher SpringerOpen
series EURASIP Journal on Audio, Speech, and Music Processing
issn 1687-4722
publishDate 2021-05-01
description Abstract Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only contain audio tags without precise temporal information. This type of dataset is therefore classified as weakly labeled dataset. In this paper, we propose a novel source separation-based method trained on weakly labeled data to solve SED problems. We build a dilated depthwise separable convolution block (DDC-block) to estimate time-frequency (T-F) masks of each sound event from a T-F representation of an audio clip. DDC-block is experimentally proven to be more effective and computationally lighter than “VGG-like” block. To fully utilize frequency characteristics of sound events, we then propose a frequency-dependent auto-pooling (FAP) function to obtain the clip-level present probability of each sound event class. A combination of two schemes, named DDC-FAP method, is evaluated on DCASE 2018 Task 2, DCASE 2020 Task4, and DCASE 2017 Task 4 datasets. The results show that DDC-FAP has a better performance than the state-of-the-art source separation-based method in SED task.
topic Sound event detection
Weakly supervised
Auto-pooling function
Depthwise separable convolution
url https://doi.org/10.1186/s13636-021-00206-7
work_keys_str_mv AT sichenliu frequencydependentautopoolingfunctionforweaklysupervisedsoundeventdetection
AT feiranyang frequencydependentautopoolingfunctionforweaklysupervisedsoundeventdetection
AT yincao frequencydependentautopoolingfunctionforweaklysupervisedsoundeventdetection
AT junyang frequencydependentautopoolingfunctionforweaklysupervisedsoundeventdetection
_version_ 1721429886457872384