Gated Recurrent Attention for Multi-Style Speech Synthesis

End-to-end neural network-based speech synthesis techniques have been developed to represent and synthesize speech in various prosodic style. Although the end-to-end techniques enable the transfer of a style with a single vector of style representation, it has been reported that the speaker similari...

Full description

Bibliographic Details
Main Authors: Sung Jun Cheon, Joun Yeop Lee, Byoung Jin Choi, Hyeonseung Lee, Nam Soo Kim
Format: Article
Language:English
Published: MDPI AG 2020-07-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/10/15/5325
Description
Summary:End-to-end neural network-based speech synthesis techniques have been developed to represent and synthesize speech in various prosodic style. Although the end-to-end techniques enable the transfer of a style with a single vector of style representation, it has been reported that the speaker similarity observed from the speech synthesized with unseen speaker-style is low. One of the reasons for this problem is that the attention mechanism in the end-to-end model is overfitted to the training data. To learn and synthesize voices of various styles, an attention mechanism that can preserve longer-term context and control the context is required. In this paper, we propose a novel attention model which employs gates to control the recurrences in the attention. To verify the proposed attention’s style modeling capability, perceptual listening tests were conducted. The experiments show that the proposed attention outperforms the location-sensitive attention in both similarity and naturalness.
ISSN:2076-3417