Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder

WaveNet, which learns directly from speech waveform samples, has been used as an alternative to vocoders and achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. However, the WaveNet vocoder uses acous...

Full description

Bibliographic Details
Main Authors: Yi Zhao, Shinji Takaki, Hieu-Thi Luong, Junichi Yamagishi, Daisuke Saito, Nobuaki Minematsu
Format: Article
Language:English
Published: IEEE 2018-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8471179/
id doaj-940663e0a8f74dbfbb1aa673ff64b3a5
record_format Article
spelling doaj-940663e0a8f74dbfbb1aa673ff64b3a52021-03-29T21:32:37ZengIEEEIEEE Access2169-35362018-01-016604786048810.1109/ACCESS.2018.28720608471179Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet VocoderYi Zhao0https://orcid.org/0000-0002-3555-9408Shinji Takaki1Hieu-Thi Luong2Junichi Yamagishi3Daisuke Saito4Nobuaki Minematsu5Department of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo, Tokyo, JapanDigital Content and Media Sciences Research Division, National Institute of Informatics, Tokyo, JapanDigital Content and Media Sciences Research Division, National Institute of Informatics, Tokyo, JapanDigital Content and Media Sciences Research Division, National Institute of Informatics, Tokyo, JapanDepartment of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo, Tokyo, JapanDepartment of Electrical Engineering and Information Systems, Graduate School of Engineering, The University of Tokyo, Tokyo, JapanWaveNet, which learns directly from speech waveform samples, has been used as an alternative to vocoders and achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. However, the WaveNet vocoder uses acoustic features as local condition parameters, and these parameters need to be accurately predicted by another acoustic model. So far, it is not yet clear how to train this acoustic model, which is problematic because the final quality of synthetic speech is significantly affected by the performance of the acoustic model. Significant degradation occurs, especially when predicted acoustic features have mismatched characteristics compared to natural ones. In order to reduce the mismatched characteristics between natural and generated acoustic features, we propose new frameworks that incorporate either a conditional generative adversarial network (GAN) or its variant, Wasserstein GAN with gradient penalty (WGAN-GP), into multi-speaker speech synthesis that uses the WaveNet vocoder. The GAN generator performs as an acoustic model and its outputs are used as the local condition parameters of the WaveNet. We also extend the GAN frameworks and use the discretized-mixture-of-logistics (DML) loss of a well-trained WaveNet in addition to mean squared error and adversarial losses as parts of objective functions. Experimental results show that acoustic models trained using the WGAN-GP framework using back-propagated DML loss achieves the highest subjective evaluation scores in terms of both quality and speaker similarity.https://ieeexplore.ieee.org/document/8471179/Generative adversarial networkmulti-speaker modelingspeech synthesisWaveNet
collection DOAJ
language English
format Article
sources DOAJ
author Yi Zhao
Shinji Takaki
Hieu-Thi Luong
Junichi Yamagishi
Daisuke Saito
Nobuaki Minematsu
spellingShingle Yi Zhao
Shinji Takaki
Hieu-Thi Luong
Junichi Yamagishi
Daisuke Saito
Nobuaki Minematsu
Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
IEEE Access
Generative adversarial network
multi-speaker modeling
speech synthesis
WaveNet
author_facet Yi Zhao
Shinji Takaki
Hieu-Thi Luong
Junichi Yamagishi
Daisuke Saito
Nobuaki Minematsu
author_sort Yi Zhao
title Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
title_short Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
title_full Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
title_fullStr Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
title_full_unstemmed Wasserstein GAN and Waveform Loss-Based Acoustic Model Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a WaveNet Vocoder
title_sort wasserstein gan and waveform loss-based acoustic model training for multi-speaker text-to-speech synthesis systems using a wavenet vocoder
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2018-01-01
description WaveNet, which learns directly from speech waveform samples, has been used as an alternative to vocoders and achieved very high-quality synthetic speech in terms of both naturalness and speaker similarity even in multi-speaker text-to-speech synthesis systems. However, the WaveNet vocoder uses acoustic features as local condition parameters, and these parameters need to be accurately predicted by another acoustic model. So far, it is not yet clear how to train this acoustic model, which is problematic because the final quality of synthetic speech is significantly affected by the performance of the acoustic model. Significant degradation occurs, especially when predicted acoustic features have mismatched characteristics compared to natural ones. In order to reduce the mismatched characteristics between natural and generated acoustic features, we propose new frameworks that incorporate either a conditional generative adversarial network (GAN) or its variant, Wasserstein GAN with gradient penalty (WGAN-GP), into multi-speaker speech synthesis that uses the WaveNet vocoder. The GAN generator performs as an acoustic model and its outputs are used as the local condition parameters of the WaveNet. We also extend the GAN frameworks and use the discretized-mixture-of-logistics (DML) loss of a well-trained WaveNet in addition to mean squared error and adversarial losses as parts of objective functions. Experimental results show that acoustic models trained using the WGAN-GP framework using back-propagated DML loss achieves the highest subjective evaluation scores in terms of both quality and speaker similarity.
topic Generative adversarial network
multi-speaker modeling
speech synthesis
WaveNet
url https://ieeexplore.ieee.org/document/8471179/
work_keys_str_mv AT yizhao wassersteinganandwaveformlossbasedacousticmodeltrainingformultispeakertexttospeechsynthesissystemsusingawavenetvocoder
AT shinjitakaki wassersteinganandwaveformlossbasedacousticmodeltrainingformultispeakertexttospeechsynthesissystemsusingawavenetvocoder
AT hieuthiluong wassersteinganandwaveformlossbasedacousticmodeltrainingformultispeakertexttospeechsynthesissystemsusingawavenetvocoder
AT junichiyamagishi wassersteinganandwaveformlossbasedacousticmodeltrainingformultispeakertexttospeechsynthesissystemsusingawavenetvocoder
AT daisukesaito wassersteinganandwaveformlossbasedacousticmodeltrainingformultispeakertexttospeechsynthesissystemsusingawavenetvocoder
AT nobuakiminematsu wassersteinganandwaveformlossbasedacousticmodeltrainingformultispeakertexttospeechsynthesissystemsusingawavenetvocoder
_version_ 1724192780454985728