Voice Conversion With CycleRNN-Based Spectral Mapping and Finely Tuned WaveNet Vocoder

In this paper, we present a novel framework for a voice conversion (VC) system based on a cyclic recurrent neural network (CycleRNN) and a finely tuned WaveNet vocoder. Even though WaveNet is capable of producing natural speech waveforms when fed with natural speech features, it still suffers from s...

Full description

Bibliographic Details
Main Authors:	Patrick Lumban Tobing, Yi-Chiao Wu, Tomoki Hayashi, Kazuhiro Kobayashi, Tomoki Toda
Format:	Article
Language:	English
Published:	IEEE 2019-01-01
Series:	IEEE Access
Subjects:	Cyclic mapping flow oversmoothed spectral features recurrent neural network spectral mapping voice conversion WaveNet fine-tuning
Online Access:	https://ieeexplore.ieee.org/document/8913551/

id	doaj-e7bfa6675bd5419cb69615d083e9b1db
record_format	Article
spelling	doaj-e7bfa6675bd5419cb69615d083e9b1db2021-03-30T00:50:08ZengIEEEIEEE Access2169-35362019-01-01717111417112510.1109/ACCESS.2019.29559788913551Voice Conversion With CycleRNN-Based Spectral Mapping and Finely Tuned WaveNet VocoderPatrick Lumban Tobing0https://orcid.org/0000-0003-2792-8418Yi-Chiao Wu1Tomoki Hayashi2Kazuhiro Kobayashi3Tomoki Toda4Graduate School of Information Science, Nagoya University, Nagoya, JapanGraduate School of Information Science, Nagoya University, Nagoya, JapanGraduate School of Information Science, Nagoya University, Nagoya, JapanInformation Technology Center, Nagoya University, Nagoya, JapanInformation Technology Center, Nagoya University, Nagoya, JapanIn this paper, we present a novel framework for a voice conversion (VC) system based on a cyclic recurrent neural network (CycleRNN) and a finely tuned WaveNet vocoder. Even though WaveNet is capable of producing natural speech waveforms when fed with natural speech features, it still suffers from speech quality degradation when fed with oversmoothed features, such as spectral parameters estimated from a statistical model. One way to address this problem is to introduce oversmoothed features while developing a WaveNet model. However, in a VC framework, providing oversmoothed spectral features of a target speaker for WaveNet modeling is not straightforward owing to the difference in the time-sequence alignment from that of a source speaker. To overcome this problem, we propose the use of a cyclic spectral conversion network, i.e., CycleRNN, capable of performing a conversion flow, i.e., source-to-target, and a cyclic flow, i.e., to generate self-predicted target spectra. The CycleRNN spectral model is trained using both conversion and weighted cyclic losses. To finely tune WaveNet, a pretrained multispeaker WaveNet model is optimized using the self-predicted features of the corresponding target speaker of a speaker conversion pair. The experimental results demonstrate that 1) the proposed CycleRNN-based spectral model for WaveNet fine-tuning significantly improves the naturalness of the converted speech waveforms, giving an overall mean opinion score of 3.50; and 2) the proposed model yields the highest speaker conversion accuracy, giving an overall speaker similarity score of 78.33%, which is a significant improvement compared with conventional WaveNet fine-tuning using natural target features.https://ieeexplore.ieee.org/document/8913551/Cyclic mapping flowoversmoothed spectral featuresrecurrent neural networkspectral mappingvoice conversionWaveNet fine-tuning
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Patrick Lumban Tobing Yi-Chiao Wu Tomoki Hayashi Kazuhiro Kobayashi Tomoki Toda
spellingShingle	Patrick Lumban Tobing Yi-Chiao Wu Tomoki Hayashi Kazuhiro Kobayashi Tomoki Toda Voice Conversion With CycleRNN-Based Spectral Mapping and Finely Tuned WaveNet Vocoder IEEE Access Cyclic mapping flow oversmoothed spectral features recurrent neural network spectral mapping voice conversion WaveNet fine-tuning
author_facet	Patrick Lumban Tobing Yi-Chiao Wu Tomoki Hayashi Kazuhiro Kobayashi Tomoki Toda
author_sort	Patrick Lumban Tobing
title	Voice Conversion With CycleRNN-Based Spectral Mapping and Finely Tuned WaveNet Vocoder
title_short	Voice Conversion With CycleRNN-Based Spectral Mapping and Finely Tuned WaveNet Vocoder
title_full	Voice Conversion With CycleRNN-Based Spectral Mapping and Finely Tuned WaveNet Vocoder
title_fullStr	Voice Conversion With CycleRNN-Based Spectral Mapping and Finely Tuned WaveNet Vocoder
title_full_unstemmed	Voice Conversion With CycleRNN-Based Spectral Mapping and Finely Tuned WaveNet Vocoder
title_sort	voice conversion with cyclernn-based spectral mapping and finely tuned wavenet vocoder
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2019-01-01
description	In this paper, we present a novel framework for a voice conversion (VC) system based on a cyclic recurrent neural network (CycleRNN) and a finely tuned WaveNet vocoder. Even though WaveNet is capable of producing natural speech waveforms when fed with natural speech features, it still suffers from speech quality degradation when fed with oversmoothed features, such as spectral parameters estimated from a statistical model. One way to address this problem is to introduce oversmoothed features while developing a WaveNet model. However, in a VC framework, providing oversmoothed spectral features of a target speaker for WaveNet modeling is not straightforward owing to the difference in the time-sequence alignment from that of a source speaker. To overcome this problem, we propose the use of a cyclic spectral conversion network, i.e., CycleRNN, capable of performing a conversion flow, i.e., source-to-target, and a cyclic flow, i.e., to generate self-predicted target spectra. The CycleRNN spectral model is trained using both conversion and weighted cyclic losses. To finely tune WaveNet, a pretrained multispeaker WaveNet model is optimized using the self-predicted features of the corresponding target speaker of a speaker conversion pair. The experimental results demonstrate that 1) the proposed CycleRNN-based spectral model for WaveNet fine-tuning significantly improves the naturalness of the converted speech waveforms, giving an overall mean opinion score of 3.50; and 2) the proposed model yields the highest speaker conversion accuracy, giving an overall speaker similarity score of 78.33%, which is a significant improvement compared with conventional WaveNet fine-tuning using natural target features.
topic	Cyclic mapping flow oversmoothed spectral features recurrent neural network spectral mapping voice conversion WaveNet fine-tuning
url	https://ieeexplore.ieee.org/document/8913551/
work_keys_str_mv	AT patricklumbantobing voiceconversionwithcyclernnbasedspectralmappingandfinelytunedwavenetvocoder AT yichiaowu voiceconversionwithcyclernnbasedspectralmappingandfinelytunedwavenetvocoder AT tomokihayashi voiceconversionwithcyclernnbasedspectralmappingandfinelytunedwavenetvocoder AT kazuhirokobayashi voiceconversionwithcyclernnbasedspectralmappingandfinelytunedwavenetvocoder AT tomokitoda voiceconversionwithcyclernnbasedspectralmappingandfinelytunedwavenetvocoder
_version_	1724187863239622656

Voice Conversion With CycleRNN-Based Spectral Mapping and Finely Tuned WaveNet Vocoder

Similar Items