Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention

Whispering is a special pronunciation style in which the vocal cords do not vibrate. Compared with voiced speech, whispering is noise-like because of the lack of a fundamental frequency. The energy of whispered speech is approximately 20 dB lower than that of voiced speech. Converting whispering int...

Full description

Bibliographic Details
Main Authors: Hailun Lian, Yuting Hu, Weiwei Yu, Jian Zhou, Wenming Zheng
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8835014/
id doaj-4712a90c9f1443f589fb1710e6ad562d
record_format Article
spelling doaj-4712a90c9f1443f589fb1710e6ad562d2021-04-05T17:32:33ZengIEEEIEEE Access2169-35362019-01-01713049513050410.1109/ACCESS.2019.29407008835014Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory AttentionHailun Lian0Yuting Hu1Weiwei Yu2Jian Zhou3https://orcid.org/0000-0001-6509-5520Wenming Zheng4https://orcid.org/0000-0002-7764-5179Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Hefei, ChinaKey Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Hefei, ChinaKey Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Hefei, ChinaKey Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Hefei, ChinaKey Laboratory of Child Development and Learning Science, Ministry of Education, Southeast University, Nanjing, ChinaWhispering is a special pronunciation style in which the vocal cords do not vibrate. Compared with voiced speech, whispering is noise-like because of the lack of a fundamental frequency. The energy of whispered speech is approximately 20 dB lower than that of voiced speech. Converting whispering into normal speech is an effective way to improve speech quality and/or intelligibility. In this paper, we propose a whisper-to-normal speech conversion method based on a sequence-to-sequence framework combined with an auditory attention mechanism. The proposed method does not require time aligning before conversion training, which makes it more applicable to real scenarios. In addition, the fundamental frequency is estimated from the mel frequency cepstral coefficients estimated by the proposed sequence-to-sequence framework. The voiced speech converted by the proposed method has appropriate length, which is determined adaptively by the proposed sequence-to-sequence model according to the source whispered speech. Experimental results show that the proposed sequence-to-sequence whisper-to-normal speech conversion method outperforms conventional DTW-based methods.https://ieeexplore.ieee.org/document/8835014/Auditory attention mechanismsequence-to-sequencespeech qualitywhisper conversion
collection DOAJ
language English
format Article
sources DOAJ
author Hailun Lian
Yuting Hu
Weiwei Yu
Jian Zhou
Wenming Zheng
spellingShingle Hailun Lian
Yuting Hu
Weiwei Yu
Jian Zhou
Wenming Zheng
Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention
IEEE Access
Auditory attention mechanism
sequence-to-sequence
speech quality
whisper conversion
author_facet Hailun Lian
Yuting Hu
Weiwei Yu
Jian Zhou
Wenming Zheng
author_sort Hailun Lian
title Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention
title_short Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention
title_full Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention
title_fullStr Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention
title_full_unstemmed Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention
title_sort whisper to normal speech conversion using sequence-to-sequence mapping model with auditory attention
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2019-01-01
description Whispering is a special pronunciation style in which the vocal cords do not vibrate. Compared with voiced speech, whispering is noise-like because of the lack of a fundamental frequency. The energy of whispered speech is approximately 20 dB lower than that of voiced speech. Converting whispering into normal speech is an effective way to improve speech quality and/or intelligibility. In this paper, we propose a whisper-to-normal speech conversion method based on a sequence-to-sequence framework combined with an auditory attention mechanism. The proposed method does not require time aligning before conversion training, which makes it more applicable to real scenarios. In addition, the fundamental frequency is estimated from the mel frequency cepstral coefficients estimated by the proposed sequence-to-sequence framework. The voiced speech converted by the proposed method has appropriate length, which is determined adaptively by the proposed sequence-to-sequence model according to the source whispered speech. Experimental results show that the proposed sequence-to-sequence whisper-to-normal speech conversion method outperforms conventional DTW-based methods.
topic Auditory attention mechanism
sequence-to-sequence
speech quality
whisper conversion
url https://ieeexplore.ieee.org/document/8835014/
work_keys_str_mv AT hailunlian whispertonormalspeechconversionusingsequencetosequencemappingmodelwithauditoryattention
AT yutinghu whispertonormalspeechconversionusingsequencetosequencemappingmodelwithauditoryattention
AT weiweiyu whispertonormalspeechconversionusingsequencetosequencemappingmodelwithauditoryattention
AT jianzhou whispertonormalspeechconversionusingsequencetosequencemappingmodelwithauditoryattention
AT wenmingzheng whispertonormalspeechconversionusingsequencetosequencemappingmodelwithauditoryattention
_version_ 1721539404645793792