Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture
This paper proposes a separation model adopting gated nested U-Net (GNU-Net) architecture, which is essentially a deeply supervised symmetric encoder–decoder network that can generate full-resolution feature maps. Through a series of nested skip pathways, it can reduce the semantic gap between the f...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2020-06-01
|
Series: | Symmetry |
Subjects: | |
Online Access: | https://www.mdpi.com/2073-8994/12/6/1051 |
id |
doaj-6c7b139df4ba45208f6f5e05620a5b01 |
---|---|
record_format |
Article |
spelling |
doaj-6c7b139df4ba45208f6f5e05620a5b012020-11-25T02:58:50ZengMDPI AGSymmetry2073-89942020-06-01121051105110.3390/sym12061051Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net ArchitectureHaibo Geng0Ying Hu1Hao Huang2School of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaSchool of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaSchool of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaThis paper proposes a separation model adopting gated nested U-Net (GNU-Net) architecture, which is essentially a deeply supervised symmetric encoder–decoder network that can generate full-resolution feature maps. Through a series of nested skip pathways, it can reduce the semantic gap between the feature maps of encoder and decoder subnetworks. In the GNU-Net architecture, only the backbone not including nested part is applied with gated linear units (GLUs) instead of conventional convolutional networks. The outputs of GNU-Net are further fed into a time-frequency (T-F) mask layer to generate two masks of singing voice and accompaniment. Then, those two estimated masks along with the magnitude and phase spectra of mixture can be transformed into time-domain signals. We explored two types of T-F mask layer, discriminative training network and difference mask layer. The experiment results show the latter to be better. We evaluated our proposed model by comparing with three models, and also with ideal T-F masks. The results demonstrate that our proposed model outperforms compared models, and it’s performance comes near to ideal ratio mask (IRM). More importantly, our proposed model can output separated singing voice and accompaniment simultaneously, while the three compared models can only separate one source with trained model.https://www.mdpi.com/2073-8994/12/6/1051winging voice separationnested U-Netgated linear unitsCNNmonaural source separation |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Haibo Geng Ying Hu Hao Huang |
spellingShingle |
Haibo Geng Ying Hu Hao Huang Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture Symmetry winging voice separation nested U-Net gated linear units CNN monaural source separation |
author_facet |
Haibo Geng Ying Hu Hao Huang |
author_sort |
Haibo Geng |
title |
Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture |
title_short |
Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture |
title_full |
Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture |
title_fullStr |
Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture |
title_full_unstemmed |
Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture |
title_sort |
monaural singing voice and accompaniment separation based on gated nested u-net architecture |
publisher |
MDPI AG |
series |
Symmetry |
issn |
2073-8994 |
publishDate |
2020-06-01 |
description |
This paper proposes a separation model adopting gated nested U-Net (GNU-Net) architecture, which is essentially a deeply supervised symmetric encoder–decoder network that can generate full-resolution feature maps. Through a series of nested skip pathways, it can reduce the semantic gap between the feature maps of encoder and decoder subnetworks. In the GNU-Net architecture, only the backbone not including nested part is applied with gated linear units (GLUs) instead of conventional convolutional networks. The outputs of GNU-Net are further fed into a time-frequency (T-F) mask layer to generate two masks of singing voice and accompaniment. Then, those two estimated masks along with the magnitude and phase spectra of mixture can be transformed into time-domain signals. We explored two types of T-F mask layer, discriminative training network and difference mask layer. The experiment results show the latter to be better. We evaluated our proposed model by comparing with three models, and also with ideal T-F masks. The results demonstrate that our proposed model outperforms compared models, and it’s performance comes near to ideal ratio mask (IRM). More importantly, our proposed model can output separated singing voice and accompaniment simultaneously, while the three compared models can only separate one source with trained model. |
topic |
winging voice separation nested U-Net gated linear units CNN monaural source separation |
url |
https://www.mdpi.com/2073-8994/12/6/1051 |
work_keys_str_mv |
AT haibogeng monauralsingingvoiceandaccompanimentseparationbasedongatednestedunetarchitecture AT yinghu monauralsingingvoiceandaccompanimentseparationbasedongatednestedunetarchitecture AT haohuang monauralsingingvoiceandaccompanimentseparationbasedongatednestedunetarchitecture |
_version_ |
1724704867212066816 |