A new joint CTC-attention-based speech recognition model with multi-level multi-head attention

Abstract A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra re...

Full description

Bibliographic Details
Main Authors: Chu-Xiong Qin, Wen-Lin Zhang, Dan Qu
Format: Article
Language:English
Published: SpringerOpen 2019-10-01
Series:EURASIP Journal on Audio, Speech, and Music Processing
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13636-019-0161-0
id doaj-17464fc86c904a17a9114d9a891854f5
record_format Article
spelling doaj-17464fc86c904a17a9114d9a891854f52020-11-25T03:10:38ZengSpringerOpenEURASIP Journal on Audio, Speech, and Music Processing1687-47222019-10-012019111210.1186/s13636-019-0161-0A new joint CTC-attention-based speech recognition model with multi-level multi-head attentionChu-Xiong Qin0Wen-Lin Zhang1Dan Qu2National Digital Switching System Engineering and Technological R&D CenterNational Digital Switching System Engineering and Technological R&D CenterNational Digital Switching System Engineering and Technological R&D CenterAbstract A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the end-to-end models, we propose improvements to the feature extraction and attention mechanism. First, we introduce a joint model trained with nonnegative matrix factorization (NMF)-based high-level features. Then, we put forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs. Experiments on TIMIT indicate that the new method achieves state-of-the-art performance with our best model. Experiments on WSJ show that our method exhibits a word error rate (WER) that is only 0.2% worse in absolute value than the best referenced method, which is trained on a much larger dataset, and it beats all present end-to-end methods. Further experiments on LibriSpeech show that our method is also comparable to the state-of-the-art end-to-end system in WER.http://link.springer.com/article/10.1186/s13636-019-0161-0Speech recognitionEnd-to-endAttention mechanism
collection DOAJ
language English
format Article
sources DOAJ
author Chu-Xiong Qin
Wen-Lin Zhang
Dan Qu
spellingShingle Chu-Xiong Qin
Wen-Lin Zhang
Dan Qu
A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
EURASIP Journal on Audio, Speech, and Music Processing
Speech recognition
End-to-end
Attention mechanism
author_facet Chu-Xiong Qin
Wen-Lin Zhang
Dan Qu
author_sort Chu-Xiong Qin
title A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
title_short A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
title_full A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
title_fullStr A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
title_full_unstemmed A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
title_sort new joint ctc-attention-based speech recognition model with multi-level multi-head attention
publisher SpringerOpen
series EURASIP Journal on Audio, Speech, and Music Processing
issn 1687-4722
publishDate 2019-10-01
description Abstract A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the end-to-end models, we propose improvements to the feature extraction and attention mechanism. First, we introduce a joint model trained with nonnegative matrix factorization (NMF)-based high-level features. Then, we put forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs. Experiments on TIMIT indicate that the new method achieves state-of-the-art performance with our best model. Experiments on WSJ show that our method exhibits a word error rate (WER) that is only 0.2% worse in absolute value than the best referenced method, which is trained on a much larger dataset, and it beats all present end-to-end methods. Further experiments on LibriSpeech show that our method is also comparable to the state-of-the-art end-to-end system in WER.
topic Speech recognition
End-to-end
Attention mechanism
url http://link.springer.com/article/10.1186/s13636-019-0161-0
work_keys_str_mv AT chuxiongqin anewjointctcattentionbasedspeechrecognitionmodelwithmultilevelmultiheadattention
AT wenlinzhang anewjointctcattentionbasedspeechrecognitionmodelwithmultilevelmultiheadattention
AT danqu anewjointctcattentionbasedspeechrecognitionmodelwithmultilevelmultiheadattention
AT chuxiongqin newjointctcattentionbasedspeechrecognitionmodelwithmultilevelmultiheadattention
AT wenlinzhang newjointctcattentionbasedspeechrecognitionmodelwithmultilevelmultiheadattention
AT danqu newjointctcattentionbasedspeechrecognitionmodelwithmultilevelmultiheadattention
_version_ 1724658250480091136