Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition

As demonstrated in hybrid connectionist temporal classification (CTC)/Attention architecture, joint training with a CTC objective is very effective to solve the misalignment problem existing in the attention-based end-to-end automatic speech recognition (ASR) framework. However, the CTC output relie...

Full description

Bibliographic Details
Main Authors:	Long Wu, Ta Li, Li Wang, Yonghong Yan
Format:	Article
Language:	English
Published:	MDPI AG 2019-10-01
Series:	Applied Sciences
Subjects:	automatic speech recognition end-to-end ctc self-attention hybrid ctc/attention
Online Access:	https://www.mdpi.com/2076-3417/9/21/4639

id	doaj-18b538b09c53488cb25b993ec03278b7
record_format	Article
spelling	doaj-18b538b09c53488cb25b993ec03278b72020-11-25T00:05:18ZengMDPI AGApplied Sciences2076-34172019-10-01921463910.3390/app9214639app9214639Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech RecognitionLong Wu0Ta Li1Li Wang2Yonghong Yan3Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, ChinaKey Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, ChinaKey Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, ChinaKey Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, ChinaAs demonstrated in hybrid connectionist temporal classification (CTC)/Attention architecture, joint training with a CTC objective is very effective to solve the misalignment problem existing in the attention-based end-to-end automatic speech recognition (ASR) framework. However, the CTC output relies only on the current input, which leads to the hard alignment issue. To address this problem, this paper proposes the time-restricted attention CTC/Attention architecture, which integrates an attention mechanism with the CTC branch. “Time-restricted” means that the attention mechanism is conducted on a limited window of frames to the left and right. In this study, we first explore time-restricted location-aware attention CTC/Attention, establishing the proper time-restricted attention window size. Inspired by the success of self-attention in machine translation, we further introduce the time-restricted self-attention CTC/Attention that can better model the long-range dependencies among the frames. Experiments with wall street journal (WSJ), augmented multiparty interaction (AMI), and switchboard (SWBD) tasks demonstrate the effectiveness of the proposed time-restricted self-attention CTC/Attention. Finally, to explore the robustness of this method to noise and reverberation, we join a train neural beamformer frontend with the time-restricted attention CTC/Attention ASR backend in the CHIME-4 dataset. The reduction of word error rate (WER) and the increase of perceptual evaluation of speech quality (PESQ) approve the effectiveness of this framework.https://www.mdpi.com/2076-3417/9/21/4639automatic speech recognitionend-to-endctcself-attentionhybrid ctc/attention
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Long Wu Ta Li Li Wang Yonghong Yan
spellingShingle	Long Wu Ta Li Li Wang Yonghong Yan Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition Applied Sciences automatic speech recognition end-to-end ctc self-attention hybrid ctc/attention
author_facet	Long Wu Ta Li Li Wang Yonghong Yan
author_sort	Long Wu
title	Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition
title_short	Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition
title_full	Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition
title_fullStr	Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition
title_full_unstemmed	Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition
title_sort	improving hybrid ctc/attention architecture with time-restricted self-attention ctc for end-to-end speech recognition
publisher	MDPI AG
series	Applied Sciences
issn	2076-3417
publishDate	2019-10-01
description	As demonstrated in hybrid connectionist temporal classification (CTC)/Attention architecture, joint training with a CTC objective is very effective to solve the misalignment problem existing in the attention-based end-to-end automatic speech recognition (ASR) framework. However, the CTC output relies only on the current input, which leads to the hard alignment issue. To address this problem, this paper proposes the time-restricted attention CTC/Attention architecture, which integrates an attention mechanism with the CTC branch. “Time-restricted” means that the attention mechanism is conducted on a limited window of frames to the left and right. In this study, we first explore time-restricted location-aware attention CTC/Attention, establishing the proper time-restricted attention window size. Inspired by the success of self-attention in machine translation, we further introduce the time-restricted self-attention CTC/Attention that can better model the long-range dependencies among the frames. Experiments with wall street journal (WSJ), augmented multiparty interaction (AMI), and switchboard (SWBD) tasks demonstrate the effectiveness of the proposed time-restricted self-attention CTC/Attention. Finally, to explore the robustness of this method to noise and reverberation, we join a train neural beamformer frontend with the time-restricted attention CTC/Attention ASR backend in the CHIME-4 dataset. The reduction of word error rate (WER) and the increase of perceptual evaluation of speech quality (PESQ) approve the effectiveness of this framework.
topic	automatic speech recognition end-to-end ctc self-attention hybrid ctc/attention
url	https://www.mdpi.com/2076-3417/9/21/4639
work_keys_str_mv	AT longwu improvinghybridctcattentionarchitecturewithtimerestrictedselfattentionctcforendtoendspeechrecognition AT tali improvinghybridctcattentionarchitecturewithtimerestrictedselfattentionctcforendtoendspeechrecognition AT liwang improvinghybridctcattentionarchitecturewithtimerestrictedselfattentionctcforendtoendspeechrecognition AT yonghongyan improvinghybridctcattentionarchitecturewithtimerestrictedselfattentionctcforendtoendspeechrecognition
_version_	1725425885415211008

Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition

Similar Items