Vocal-Accompaniment Compatibility Estimation Using Self-Supervised and Joint-Embedding Techniques

We propose a learning-based method of estimating the compatibility between vocal and accompaniment audio tracks, <italic>i.e.</italic>, how well they go with each other when played simultaneously. This task is challenging because it is difficult to formulate hand-crafted rules or constru...

Full description

Bibliographic Details
Main Authors: Takayuki Nakatsuka, Kento Watanabe, Yuki Koyama, Masahiro Hamasaki, Masataka Goto, Shigeo Morishima
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9481947/
id doaj-2a5457fc8bf845a89e22beb25c6afb9c
record_format Article
spelling doaj-2a5457fc8bf845a89e22beb25c6afb9c2021-07-26T23:01:32ZengIEEEIEEE Access2169-35362021-01-01910199410200310.1109/ACCESS.2021.30968199481947Vocal-Accompaniment Compatibility Estimation Using Self-Supervised and Joint-Embedding TechniquesTakayuki Nakatsuka0https://orcid.org/0000-0003-3181-4894Kento Watanabe1Yuki Koyama2https://orcid.org/0000-0002-3978-1444Masahiro Hamasaki3https://orcid.org/0000-0003-3085-7446Masataka Goto4Shigeo Morishima5https://orcid.org/0000-0001-8859-6539National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, Ibaraki, JapanNational Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, Ibaraki, JapanNational Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, Ibaraki, JapanNational Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, Ibaraki, JapanNational Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, Ibaraki, JapanWaseda University, Shinjuku, Tokyo, JapanWe propose a learning-based method of estimating the compatibility between vocal and accompaniment audio tracks, <italic>i.e.</italic>, how well they go with each other when played simultaneously. This task is challenging because it is difficult to formulate hand-crafted rules or construct a large labeled dataset to perform supervised learning. Our method uses self-supervised and joint-embedding techniques for estimating vocal-accompaniment compatibility. We train vocal and accompaniment encoders to learn a joint-embedding space of vocal and accompaniment tracks, where the embedded feature vectors of a compatible pair of vocal and accompaniment tracks lie close to each other and those of an incompatible pair lie far from each other. To address the lack of large labeled datasets consisting of compatible and incompatible pairs of vocal and accompaniment tracks, we propose generating such a dataset from songs using singing voice separation techniques, with which songs are separated into pairs of vocal and accompaniment tracks, and then original pairs are assumed to be compatible, and other random pairs are not. We achieved this training by constructing a large dataset containing 910,803 songs and evaluated the effectiveness of our method using ranking-based evaluation methods.https://ieeexplore.ieee.org/document/9481947/Vocal-accompaniment compatibilitymetric learningmusic signal processingmusic information retrieval
collection DOAJ
language English
format Article
sources DOAJ
author Takayuki Nakatsuka
Kento Watanabe
Yuki Koyama
Masahiro Hamasaki
Masataka Goto
Shigeo Morishima
spellingShingle Takayuki Nakatsuka
Kento Watanabe
Yuki Koyama
Masahiro Hamasaki
Masataka Goto
Shigeo Morishima
Vocal-Accompaniment Compatibility Estimation Using Self-Supervised and Joint-Embedding Techniques
IEEE Access
Vocal-accompaniment compatibility
metric learning
music signal processing
music information retrieval
author_facet Takayuki Nakatsuka
Kento Watanabe
Yuki Koyama
Masahiro Hamasaki
Masataka Goto
Shigeo Morishima
author_sort Takayuki Nakatsuka
title Vocal-Accompaniment Compatibility Estimation Using Self-Supervised and Joint-Embedding Techniques
title_short Vocal-Accompaniment Compatibility Estimation Using Self-Supervised and Joint-Embedding Techniques
title_full Vocal-Accompaniment Compatibility Estimation Using Self-Supervised and Joint-Embedding Techniques
title_fullStr Vocal-Accompaniment Compatibility Estimation Using Self-Supervised and Joint-Embedding Techniques
title_full_unstemmed Vocal-Accompaniment Compatibility Estimation Using Self-Supervised and Joint-Embedding Techniques
title_sort vocal-accompaniment compatibility estimation using self-supervised and joint-embedding techniques
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2021-01-01
description We propose a learning-based method of estimating the compatibility between vocal and accompaniment audio tracks, <italic>i.e.</italic>, how well they go with each other when played simultaneously. This task is challenging because it is difficult to formulate hand-crafted rules or construct a large labeled dataset to perform supervised learning. Our method uses self-supervised and joint-embedding techniques for estimating vocal-accompaniment compatibility. We train vocal and accompaniment encoders to learn a joint-embedding space of vocal and accompaniment tracks, where the embedded feature vectors of a compatible pair of vocal and accompaniment tracks lie close to each other and those of an incompatible pair lie far from each other. To address the lack of large labeled datasets consisting of compatible and incompatible pairs of vocal and accompaniment tracks, we propose generating such a dataset from songs using singing voice separation techniques, with which songs are separated into pairs of vocal and accompaniment tracks, and then original pairs are assumed to be compatible, and other random pairs are not. We achieved this training by constructing a large dataset containing 910,803 songs and evaluated the effectiveness of our method using ranking-based evaluation methods.
topic Vocal-accompaniment compatibility
metric learning
music signal processing
music information retrieval
url https://ieeexplore.ieee.org/document/9481947/
work_keys_str_mv AT takayukinakatsuka vocalaccompanimentcompatibilityestimationusingselfsupervisedandjointembeddingtechniques
AT kentowatanabe vocalaccompanimentcompatibilityestimationusingselfsupervisedandjointembeddingtechniques
AT yukikoyama vocalaccompanimentcompatibilityestimationusingselfsupervisedandjointembeddingtechniques
AT masahirohamasaki vocalaccompanimentcompatibilityestimationusingselfsupervisedandjointembeddingtechniques
AT masatakagoto vocalaccompanimentcompatibilityestimationusingselfsupervisedandjointembeddingtechniques
AT shigeomorishima vocalaccompanimentcompatibilityestimationusingselfsupervisedandjointembeddingtechniques
_version_ 1721280388520738816