Frequency-dependent auto-pooling function for weakly supervised sound event detection
Abstract Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only cont...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
SpringerOpen
2021-05-01
|
Series: | EURASIP Journal on Audio, Speech, and Music Processing |
Subjects: | |
Online Access: | https://doi.org/10.1186/s13636-021-00206-7 |
id |
doaj-cefe15ceed614d479c9df4dd3266080a |
---|---|
record_format |
Article |
spelling |
doaj-cefe15ceed614d479c9df4dd3266080a2021-05-23T11:24:01ZengSpringerOpenEURASIP Journal on Audio, Speech, and Music Processing1687-47222021-05-012021111110.1186/s13636-021-00206-7Frequency-dependent auto-pooling function for weakly supervised sound event detectionSichen Liu0Feiran Yang1Yin Cao2Jun Yang3Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of SciencesUniversity of Chinese Academy of SciencesCentre for Vision, Speech and Signal Processing (CVSSP), University of SurreyKey Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of SciencesAbstract Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only contain audio tags without precise temporal information. This type of dataset is therefore classified as weakly labeled dataset. In this paper, we propose a novel source separation-based method trained on weakly labeled data to solve SED problems. We build a dilated depthwise separable convolution block (DDC-block) to estimate time-frequency (T-F) masks of each sound event from a T-F representation of an audio clip. DDC-block is experimentally proven to be more effective and computationally lighter than “VGG-like” block. To fully utilize frequency characteristics of sound events, we then propose a frequency-dependent auto-pooling (FAP) function to obtain the clip-level present probability of each sound event class. A combination of two schemes, named DDC-FAP method, is evaluated on DCASE 2018 Task 2, DCASE 2020 Task4, and DCASE 2017 Task 4 datasets. The results show that DDC-FAP has a better performance than the state-of-the-art source separation-based method in SED task.https://doi.org/10.1186/s13636-021-00206-7Sound event detectionWeakly supervisedAuto-pooling functionDepthwise separable convolution |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Sichen Liu Feiran Yang Yin Cao Jun Yang |
spellingShingle |
Sichen Liu Feiran Yang Yin Cao Jun Yang Frequency-dependent auto-pooling function for weakly supervised sound event detection EURASIP Journal on Audio, Speech, and Music Processing Sound event detection Weakly supervised Auto-pooling function Depthwise separable convolution |
author_facet |
Sichen Liu Feiran Yang Yin Cao Jun Yang |
author_sort |
Sichen Liu |
title |
Frequency-dependent auto-pooling function for weakly supervised sound event detection |
title_short |
Frequency-dependent auto-pooling function for weakly supervised sound event detection |
title_full |
Frequency-dependent auto-pooling function for weakly supervised sound event detection |
title_fullStr |
Frequency-dependent auto-pooling function for weakly supervised sound event detection |
title_full_unstemmed |
Frequency-dependent auto-pooling function for weakly supervised sound event detection |
title_sort |
frequency-dependent auto-pooling function for weakly supervised sound event detection |
publisher |
SpringerOpen |
series |
EURASIP Journal on Audio, Speech, and Music Processing |
issn |
1687-4722 |
publishDate |
2021-05-01 |
description |
Abstract Sound event detection (SED), which is typically treated as a supervised problem, aims at detecting types of sound events and corresponding temporal information. It requires to estimate onset and offset annotations for sound events at each frame. Many available sound event datasets only contain audio tags without precise temporal information. This type of dataset is therefore classified as weakly labeled dataset. In this paper, we propose a novel source separation-based method trained on weakly labeled data to solve SED problems. We build a dilated depthwise separable convolution block (DDC-block) to estimate time-frequency (T-F) masks of each sound event from a T-F representation of an audio clip. DDC-block is experimentally proven to be more effective and computationally lighter than “VGG-like” block. To fully utilize frequency characteristics of sound events, we then propose a frequency-dependent auto-pooling (FAP) function to obtain the clip-level present probability of each sound event class. A combination of two schemes, named DDC-FAP method, is evaluated on DCASE 2018 Task 2, DCASE 2020 Task4, and DCASE 2017 Task 4 datasets. The results show that DDC-FAP has a better performance than the state-of-the-art source separation-based method in SED task. |
topic |
Sound event detection Weakly supervised Auto-pooling function Depthwise separable convolution |
url |
https://doi.org/10.1186/s13636-021-00206-7 |
work_keys_str_mv |
AT sichenliu frequencydependentautopoolingfunctionforweaklysupervisedsoundeventdetection AT feiranyang frequencydependentautopoolingfunctionforweaklysupervisedsoundeventdetection AT yincao frequencydependentautopoolingfunctionforweaklysupervisedsoundeventdetection AT junyang frequencydependentautopoolingfunctionforweaklysupervisedsoundeventdetection |
_version_ |
1721429886457872384 |