Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture

This paper proposes a separation model adopting gated nested U-Net (GNU-Net) architecture, which is essentially a deeply supervised symmetric encoder–decoder network that can generate full-resolution feature maps. Through a series of nested skip pathways, it can reduce the semantic gap between the f...

Full description

Bibliographic Details
Main Authors: Haibo Geng, Ying Hu, Hao Huang
Format: Article
Language:English
Published: MDPI AG 2020-06-01
Series:Symmetry
Subjects:
CNN
Online Access:https://www.mdpi.com/2073-8994/12/6/1051
id doaj-6c7b139df4ba45208f6f5e05620a5b01
record_format Article
spelling doaj-6c7b139df4ba45208f6f5e05620a5b012020-11-25T02:58:50ZengMDPI AGSymmetry2073-89942020-06-01121051105110.3390/sym12061051Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net ArchitectureHaibo Geng0Ying Hu1Hao Huang2School of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaSchool of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaSchool of Information Science and Engineering, Xinjiang University, Urumqi 830046, ChinaThis paper proposes a separation model adopting gated nested U-Net (GNU-Net) architecture, which is essentially a deeply supervised symmetric encoder–decoder network that can generate full-resolution feature maps. Through a series of nested skip pathways, it can reduce the semantic gap between the feature maps of encoder and decoder subnetworks. In the GNU-Net architecture, only the backbone not including nested part is applied with gated linear units (GLUs) instead of conventional convolutional networks. The outputs of GNU-Net are further fed into a time-frequency (T-F) mask layer to generate two masks of singing voice and accompaniment. Then, those two estimated masks along with the magnitude and phase spectra of mixture can be transformed into time-domain signals. We explored two types of T-F mask layer, discriminative training network and difference mask layer. The experiment results show the latter to be better. We evaluated our proposed model by comparing with three models, and also with ideal T-F masks. The results demonstrate that our proposed model outperforms compared models, and it’s performance comes near to ideal ratio mask (IRM). More importantly, our proposed model can output separated singing voice and accompaniment simultaneously, while the three compared models can only separate one source with trained model.https://www.mdpi.com/2073-8994/12/6/1051winging voice separationnested U-Netgated linear unitsCNNmonaural source separation
collection DOAJ
language English
format Article
sources DOAJ
author Haibo Geng
Ying Hu
Hao Huang
spellingShingle Haibo Geng
Ying Hu
Hao Huang
Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture
Symmetry
winging voice separation
nested U-Net
gated linear units
CNN
monaural source separation
author_facet Haibo Geng
Ying Hu
Hao Huang
author_sort Haibo Geng
title Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture
title_short Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture
title_full Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture
title_fullStr Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture
title_full_unstemmed Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture
title_sort monaural singing voice and accompaniment separation based on gated nested u-net architecture
publisher MDPI AG
series Symmetry
issn 2073-8994
publishDate 2020-06-01
description This paper proposes a separation model adopting gated nested U-Net (GNU-Net) architecture, which is essentially a deeply supervised symmetric encoder–decoder network that can generate full-resolution feature maps. Through a series of nested skip pathways, it can reduce the semantic gap between the feature maps of encoder and decoder subnetworks. In the GNU-Net architecture, only the backbone not including nested part is applied with gated linear units (GLUs) instead of conventional convolutional networks. The outputs of GNU-Net are further fed into a time-frequency (T-F) mask layer to generate two masks of singing voice and accompaniment. Then, those two estimated masks along with the magnitude and phase spectra of mixture can be transformed into time-domain signals. We explored two types of T-F mask layer, discriminative training network and difference mask layer. The experiment results show the latter to be better. We evaluated our proposed model by comparing with three models, and also with ideal T-F masks. The results demonstrate that our proposed model outperforms compared models, and it’s performance comes near to ideal ratio mask (IRM). More importantly, our proposed model can output separated singing voice and accompaniment simultaneously, while the three compared models can only separate one source with trained model.
topic winging voice separation
nested U-Net
gated linear units
CNN
monaural source separation
url https://www.mdpi.com/2073-8994/12/6/1051
work_keys_str_mv AT haibogeng monauralsingingvoiceandaccompanimentseparationbasedongatednestedunetarchitecture
AT yinghu monauralsingingvoiceandaccompanimentseparationbasedongatednestedunetarchitecture
AT haohuang monauralsingingvoiceandaccompanimentseparationbasedongatednestedunetarchitecture
_version_ 1724704867212066816