A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

Abstract A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra re...

Full description

Bibliographic Details
Main Authors:	Chu-Xiong Qin, Wen-Lin Zhang, Dan Qu
Format:	Article
Language:	English
Published:	SpringerOpen 2019-10-01
Series:	EURASIP Journal on Audio, Speech, and Music Processing
Subjects:	Speech recognition End-to-end Attention mechanism
Online Access:	http://link.springer.com/article/10.1186/s13636-019-0161-0

id	doaj-17464fc86c904a17a9114d9a891854f5
record_format	Article
spelling	doaj-17464fc86c904a17a9114d9a891854f52020-11-25T03:10:38ZengSpringerOpenEURASIP Journal on Audio, Speech, and Music Processing1687-47222019-10-012019111210.1186/s13636-019-0161-0A new joint CTC-attention-based speech recognition model with multi-level multi-head attentionChu-Xiong Qin0Wen-Lin Zhang1Dan Qu2National Digital Switching System Engineering and Technological R&D CenterNational Digital Switching System Engineering and Technological R&D CenterNational Digital Switching System Engineering and Technological R&D CenterAbstract A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the end-to-end models, we propose improvements to the feature extraction and attention mechanism. First, we introduce a joint model trained with nonnegative matrix factorization (NMF)-based high-level features. Then, we put forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs. Experiments on TIMIT indicate that the new method achieves state-of-the-art performance with our best model. Experiments on WSJ show that our method exhibits a word error rate (WER) that is only 0.2% worse in absolute value than the best referenced method, which is trained on a much larger dataset, and it beats all present end-to-end methods. Further experiments on LibriSpeech show that our method is also comparable to the state-of-the-art end-to-end system in WER.http://link.springer.com/article/10.1186/s13636-019-0161-0Speech recognitionEnd-to-endAttention mechanism
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Chu-Xiong Qin Wen-Lin Zhang Dan Qu
spellingShingle	Chu-Xiong Qin Wen-Lin Zhang Dan Qu A new joint CTC-attention-based speech recognition model with multi-level multi-head attention EURASIP Journal on Audio, Speech, and Music Processing Speech recognition End-to-end Attention mechanism
author_facet	Chu-Xiong Qin Wen-Lin Zhang Dan Qu
author_sort	Chu-Xiong Qin
title	A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
title_short	A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
title_full	A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
title_fullStr	A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
title_full_unstemmed	A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
title_sort	new joint ctc-attention-based speech recognition model with multi-level multi-head attention
publisher	SpringerOpen
series	EURASIP Journal on Audio, Speech, and Music Processing
issn	1687-4722
publishDate	2019-10-01
description	Abstract A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the end-to-end models, we propose improvements to the feature extraction and attention mechanism. First, we introduce a joint model trained with nonnegative matrix factorization (NMF)-based high-level features. Then, we put forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs. Experiments on TIMIT indicate that the new method achieves state-of-the-art performance with our best model. Experiments on WSJ show that our method exhibits a word error rate (WER) that is only 0.2% worse in absolute value than the best referenced method, which is trained on a much larger dataset, and it beats all present end-to-end methods. Further experiments on LibriSpeech show that our method is also comparable to the state-of-the-art end-to-end system in WER.
topic	Speech recognition End-to-end Attention mechanism
url	http://link.springer.com/article/10.1186/s13636-019-0161-0
work_keys_str_mv	AT chuxiongqin anewjointctcattentionbasedspeechrecognitionmodelwithmultilevelmultiheadattention AT wenlinzhang anewjointctcattentionbasedspeechrecognitionmodelwithmultilevelmultiheadattention AT danqu anewjointctcattentionbasedspeechrecognitionmodelwithmultilevelmultiheadattention AT chuxiongqin newjointctcattentionbasedspeechrecognitionmodelwithmultilevelmultiheadattention AT wenlinzhang newjointctcattentionbasedspeechrecognitionmodelwithmultilevelmultiheadattention AT danqu newjointctcattentionbasedspeechrecognitionmodelwithmultilevelmultiheadattention
_version_	1724658250480091136

A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

Similar Items