TTS-Guided Training for Accent Conversion Without Parallel Data

Accent Conversion (AC) seeks to change the accent of speech from one (source) to another (target) while preserving the speech content and speaker identity. However, many existing AC approaches rely on source-target parallel speech data during training or reference speech at run-time. We propose a no...

Full description

Bibliographic Details
Main Authors: Li, H. (Author), Tian, X. (Author), Wu, Z. (Author), Zhang, M. (Author), Zhou, Y. (Author)
Format: Article
Language:English
Published: Institute of Electrical and Electronics Engineers Inc. 2023
Subjects:
Online Access:View Fulltext in Publisher
View in Scopus
LEADER 02900nam a2200457Ia 4500
001 10.1109-LSP.2023.3270079
008 230529s2023 CNT 000 0 und d
020 |a 10709908 (ISSN) 
245 1 0 |a TTS-Guided Training for Accent Conversion Without Parallel Data 
260 0 |b Institute of Electrical and Electronics Engineers Inc.  |c 2023 
300 |a 5 
856 |z View Fulltext in Publisher  |u https://doi.org/10.1109/LSP.2023.3270079 
856 |z View in Scopus  |u https://www.scopus.com/inward/record.uri?eid=2-s2.0-85159721330&doi=10.1109%2fLSP.2023.3270079&partnerID=40&md5=0fdcacbf61b2ee160f836c353cd2b5b0 
520 3 |a Accent Conversion (AC) seeks to change the accent of speech from one (source) to another (target) while preserving the speech content and speaker identity. However, many existing AC approaches rely on source-target parallel speech data during training or reference speech at run-time. We propose a novel accent conversion framework without the need for either parallel data or reference speech. Specifically, a text-to-speech (TTS) system is first pretrained with target-accented speech data. This TTS model and its hidden representations are expected to be associated only with the target accent. Then, a speech encoder is trained to convert the accent of the speech under the supervision of the pretrained TTS model. In doing so, the source-accented speech and its corresponding transcription are forwarded to the speech encoder and the pretrained TTS, respectively. The output of the speech encoder is optimized to be the same as the text embedding in the TTS system. At run-time, the speech encoder is combined with the pretrained speech decoder to convert the source-accented speech toward the target. In the experiments, we converted English with two source accents (Chinese/Indian) to the target accent (American/British/Canadian). Both objective metrics and subjective listening tests successfully validate that the proposed approach generates speech samples that are close to the target accent with high speech quality. Author 
650 0 4 |a Accent conversion 
650 0 4 |a accent conversion (AC) 
650 0 4 |a Accented speech 
650 0 4 |a Acoustics 
650 0 4 |a Data handling 
650 0 4 |a Data models 
650 0 4 |a Decoding 
650 0 4 |a Error analysis 
650 0 4 |a Feature extraction 
650 0 4 |a Features extraction 
650 0 4 |a Parallel data 
650 0 4 |a Phonetics 
650 0 4 |a Runtimes 
650 0 4 |a Signal encoding 
650 0 4 |a Speech data 
650 0 4 |a Speech recognition 
650 0 4 |a Text to speech 
650 0 4 |a Text-to-speech 
650 0 4 |a text-to-speech (TTS) 
650 0 4 |a Text-to-speech system 
650 0 4 |a Training 
700 1 0 |a Li, H.  |e author 
700 1 0 |a Tian, X.  |e author 
700 1 0 |a Wu, Z.  |e author 
700 1 0 |a Zhang, M.  |e author 
700 1 0 |a Zhou, Y.  |e author 
773 |t IEEE Signal Processing Letters