Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention

Whispering is a special pronunciation style in which the vocal cords do not vibrate. Compared with voiced speech, whispering is noise-like because of the lack of a fundamental frequency. The energy of whispered speech is approximately 20 dB lower than that of voiced speech. Converting whispering int...

Full description

Bibliographic Details
Main Authors:	Hailun Lian, Yuting Hu, Weiwei Yu, Jian Zhou, Wenming Zheng
Format:	Article
Language:	English
Published:	IEEE 2019-01-01
Series:	IEEE Access
Subjects:	Auditory attention mechanism sequence-to-sequence speech quality whisper conversion
Online Access:	https://ieeexplore.ieee.org/document/8835014/

id	doaj-4712a90c9f1443f589fb1710e6ad562d
record_format	Article
spelling	doaj-4712a90c9f1443f589fb1710e6ad562d2021-04-05T17:32:33ZengIEEEIEEE Access2169-35362019-01-01713049513050410.1109/ACCESS.2019.29407008835014Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory AttentionHailun Lian0Yuting Hu1Weiwei Yu2Jian Zhou3https://orcid.org/0000-0001-6509-5520Wenming Zheng4https://orcid.org/0000-0002-7764-5179Key Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Hefei, ChinaKey Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Hefei, ChinaKey Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Hefei, ChinaKey Laboratory of Intelligent Computing and Signal Processing, Ministry of Education, Anhui University, Hefei, ChinaKey Laboratory of Child Development and Learning Science, Ministry of Education, Southeast University, Nanjing, ChinaWhispering is a special pronunciation style in which the vocal cords do not vibrate. Compared with voiced speech, whispering is noise-like because of the lack of a fundamental frequency. The energy of whispered speech is approximately 20 dB lower than that of voiced speech. Converting whispering into normal speech is an effective way to improve speech quality and/or intelligibility. In this paper, we propose a whisper-to-normal speech conversion method based on a sequence-to-sequence framework combined with an auditory attention mechanism. The proposed method does not require time aligning before conversion training, which makes it more applicable to real scenarios. In addition, the fundamental frequency is estimated from the mel frequency cepstral coefficients estimated by the proposed sequence-to-sequence framework. The voiced speech converted by the proposed method has appropriate length, which is determined adaptively by the proposed sequence-to-sequence model according to the source whispered speech. Experimental results show that the proposed sequence-to-sequence whisper-to-normal speech conversion method outperforms conventional DTW-based methods.https://ieeexplore.ieee.org/document/8835014/Auditory attention mechanismsequence-to-sequencespeech qualitywhisper conversion
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Hailun Lian Yuting Hu Weiwei Yu Jian Zhou Wenming Zheng
spellingShingle	Hailun Lian Yuting Hu Weiwei Yu Jian Zhou Wenming Zheng Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention IEEE Access Auditory attention mechanism sequence-to-sequence speech quality whisper conversion
author_facet	Hailun Lian Yuting Hu Weiwei Yu Jian Zhou Wenming Zheng
author_sort	Hailun Lian
title	Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention
title_short	Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention
title_full	Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention
title_fullStr	Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention
title_full_unstemmed	Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention
title_sort	whisper to normal speech conversion using sequence-to-sequence mapping model with auditory attention
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2019-01-01
description	Whispering is a special pronunciation style in which the vocal cords do not vibrate. Compared with voiced speech, whispering is noise-like because of the lack of a fundamental frequency. The energy of whispered speech is approximately 20 dB lower than that of voiced speech. Converting whispering into normal speech is an effective way to improve speech quality and/or intelligibility. In this paper, we propose a whisper-to-normal speech conversion method based on a sequence-to-sequence framework combined with an auditory attention mechanism. The proposed method does not require time aligning before conversion training, which makes it more applicable to real scenarios. In addition, the fundamental frequency is estimated from the mel frequency cepstral coefficients estimated by the proposed sequence-to-sequence framework. The voiced speech converted by the proposed method has appropriate length, which is determined adaptively by the proposed sequence-to-sequence model according to the source whispered speech. Experimental results show that the proposed sequence-to-sequence whisper-to-normal speech conversion method outperforms conventional DTW-based methods.
topic	Auditory attention mechanism sequence-to-sequence speech quality whisper conversion
url	https://ieeexplore.ieee.org/document/8835014/
work_keys_str_mv	AT hailunlian whispertonormalspeechconversionusingsequencetosequencemappingmodelwithauditoryattention AT yutinghu whispertonormalspeechconversionusingsequencetosequencemappingmodelwithauditoryattention AT weiweiyu whispertonormalspeechconversionusingsequencetosequencemappingmodelwithauditoryattention AT jianzhou whispertonormalspeechconversionusingsequencetosequencemappingmodelwithauditoryattention AT wenmingzheng whispertonormalspeechconversionusingsequencetosequencemappingmodelwithauditoryattention
_version_	1721539404645793792

Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention

Similar Items