Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention
Whispering is a special pronunciation style in which the vocal cords do not vibrate. Compared with voiced speech, whispering is noise-like because of the lack of a fundamental frequency. The energy of whispered speech is approximately 20 dB lower than that of voiced speech. Converting whispering int...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2019-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/8835014/ |
id |
doaj-4712a90c9f1443f589fb1710e6ad562d |
---|---|
record_format |
Article |
spelling |
doaj-4712a90c9f1443f589fb1710e6ad562d2021-04-05T17:32:33ZengIEEEIEEE Access2169-35362019-01-01713049513050410.1109/ACCESS.2019.29407008835014Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory AttentionHailun Lian0Yuting Hu1Weiwei Yu2Jian Zhou3https://orcid.org/0000-0001-6509-5520Wenming Zheng4https://orcid.org/0000-0002-7764-5179Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Hefei, ChinaKey Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Hefei, ChinaKey Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Hefei, ChinaKey Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Hefei, ChinaKey Laboratory of Child Development and Learning Science, Ministry of Education, Southeast University, Nanjing, ChinaWhispering is a special pronunciation style in which the vocal cords do not vibrate. Compared with voiced speech, whispering is noise-like because of the lack of a fundamental frequency. The energy of whispered speech is approximately 20 dB lower than that of voiced speech. Converting whispering into normal speech is an effective way to improve speech quality and/or intelligibility. In this paper, we propose a whisper-to-normal speech conversion method based on a sequence-to-sequence framework combined with an auditory attention mechanism. The proposed method does not require time aligning before conversion training, which makes it more applicable to real scenarios. In addition, the fundamental frequency is estimated from the mel frequency cepstral coefficients estimated by the proposed sequence-to-sequence framework. The voiced speech converted by the proposed method has appropriate length, which is determined adaptively by the proposed sequence-to-sequence model according to the source whispered speech. Experimental results show that the proposed sequence-to-sequence whisper-to-normal speech conversion method outperforms conventional DTW-based methods.https://ieeexplore.ieee.org/document/8835014/Auditory attention mechanismsequence-to-sequencespeech qualitywhisper conversion |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Hailun Lian Yuting Hu Weiwei Yu Jian Zhou Wenming Zheng |
spellingShingle |
Hailun Lian Yuting Hu Weiwei Yu Jian Zhou Wenming Zheng Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention IEEE Access Auditory attention mechanism sequence-to-sequence speech quality whisper conversion |
author_facet |
Hailun Lian Yuting Hu Weiwei Yu Jian Zhou Wenming Zheng |
author_sort |
Hailun Lian |
title |
Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention |
title_short |
Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention |
title_full |
Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention |
title_fullStr |
Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention |
title_full_unstemmed |
Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention |
title_sort |
whisper to normal speech conversion using sequence-to-sequence mapping model with auditory attention |
publisher |
IEEE |
series |
IEEE Access |
issn |
2169-3536 |
publishDate |
2019-01-01 |
description |
Whispering is a special pronunciation style in which the vocal cords do not vibrate. Compared with voiced speech, whispering is noise-like because of the lack of a fundamental frequency. The energy of whispered speech is approximately 20 dB lower than that of voiced speech. Converting whispering into normal speech is an effective way to improve speech quality and/or intelligibility. In this paper, we propose a whisper-to-normal speech conversion method based on a sequence-to-sequence framework combined with an auditory attention mechanism. The proposed method does not require time aligning before conversion training, which makes it more applicable to real scenarios. In addition, the fundamental frequency is estimated from the mel frequency cepstral coefficients estimated by the proposed sequence-to-sequence framework. The voiced speech converted by the proposed method has appropriate length, which is determined adaptively by the proposed sequence-to-sequence model according to the source whispered speech. Experimental results show that the proposed sequence-to-sequence whisper-to-normal speech conversion method outperforms conventional DTW-based methods. |
topic |
Auditory attention mechanism sequence-to-sequence speech quality whisper conversion |
url |
https://ieeexplore.ieee.org/document/8835014/ |
work_keys_str_mv |
AT hailunlian whispertonormalspeechconversionusingsequencetosequencemappingmodelwithauditoryattention AT yutinghu whispertonormalspeechconversionusingsequencetosequencemappingmodelwithauditoryattention AT weiweiyu whispertonormalspeechconversionusingsequencetosequencemappingmodelwithauditoryattention AT jianzhou whispertonormalspeechconversionusingsequencetosequencemappingmodelwithauditoryattention AT wenmingzheng whispertonormalspeechconversionusingsequencetosequencemappingmodelwithauditoryattention |
_version_ |
1721539404645793792 |