A new joint CTC-attention-based speech recognition model with multi-level multi-head attention
Abstract A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra re...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
SpringerOpen
2019-10-01
|
Series: | EURASIP Journal on Audio, Speech, and Music Processing |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s13636-019-0161-0 |
id |
doaj-17464fc86c904a17a9114d9a891854f5 |
---|---|
record_format |
Article |
spelling |
doaj-17464fc86c904a17a9114d9a891854f52020-11-25T03:10:38ZengSpringerOpenEURASIP Journal on Audio, Speech, and Music Processing1687-47222019-10-012019111210.1186/s13636-019-0161-0A new joint CTC-attention-based speech recognition model with multi-level multi-head attentionChu-Xiong Qin0Wen-Lin Zhang1Dan Qu2National Digital Switching System Engineering and Technological R&D CenterNational Digital Switching System Engineering and Technological R&D CenterNational Digital Switching System Engineering and Technological R&D CenterAbstract A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the end-to-end models, we propose improvements to the feature extraction and attention mechanism. First, we introduce a joint model trained with nonnegative matrix factorization (NMF)-based high-level features. Then, we put forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs. Experiments on TIMIT indicate that the new method achieves state-of-the-art performance with our best model. Experiments on WSJ show that our method exhibits a word error rate (WER) that is only 0.2% worse in absolute value than the best referenced method, which is trained on a much larger dataset, and it beats all present end-to-end methods. Further experiments on LibriSpeech show that our method is also comparable to the state-of-the-art end-to-end system in WER.http://link.springer.com/article/10.1186/s13636-019-0161-0Speech recognitionEnd-to-endAttention mechanism |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Chu-Xiong Qin Wen-Lin Zhang Dan Qu |
spellingShingle |
Chu-Xiong Qin Wen-Lin Zhang Dan Qu A new joint CTC-attention-based speech recognition model with multi-level multi-head attention EURASIP Journal on Audio, Speech, and Music Processing Speech recognition End-to-end Attention mechanism |
author_facet |
Chu-Xiong Qin Wen-Lin Zhang Dan Qu |
author_sort |
Chu-Xiong Qin |
title |
A new joint CTC-attention-based speech recognition model with multi-level multi-head attention |
title_short |
A new joint CTC-attention-based speech recognition model with multi-level multi-head attention |
title_full |
A new joint CTC-attention-based speech recognition model with multi-level multi-head attention |
title_fullStr |
A new joint CTC-attention-based speech recognition model with multi-level multi-head attention |
title_full_unstemmed |
A new joint CTC-attention-based speech recognition model with multi-level multi-head attention |
title_sort |
new joint ctc-attention-based speech recognition model with multi-level multi-head attention |
publisher |
SpringerOpen |
series |
EURASIP Journal on Audio, Speech, and Music Processing |
issn |
1687-4722 |
publishDate |
2019-10-01 |
description |
Abstract A method called joint connectionist temporal classification (CTC)-attention-based speech recognition has recently received increasing focus and has achieved impressive performance. A hybrid end-to-end architecture that adds an extra CTC loss to the attention-based model could force extra restrictions on alignments. To explore better the end-to-end models, we propose improvements to the feature extraction and attention mechanism. First, we introduce a joint model trained with nonnegative matrix factorization (NMF)-based high-level features. Then, we put forward a hybrid attention mechanism by incorporating multi-head attentions and calculating attention scores over multi-level outputs. Experiments on TIMIT indicate that the new method achieves state-of-the-art performance with our best model. Experiments on WSJ show that our method exhibits a word error rate (WER) that is only 0.2% worse in absolute value than the best referenced method, which is trained on a much larger dataset, and it beats all present end-to-end methods. Further experiments on LibriSpeech show that our method is also comparable to the state-of-the-art end-to-end system in WER. |
topic |
Speech recognition End-to-end Attention mechanism |
url |
http://link.springer.com/article/10.1186/s13636-019-0161-0 |
work_keys_str_mv |
AT chuxiongqin anewjointctcattentionbasedspeechrecognitionmodelwithmultilevelmultiheadattention AT wenlinzhang anewjointctcattentionbasedspeechrecognitionmodelwithmultilevelmultiheadattention AT danqu anewjointctcattentionbasedspeechrecognitionmodelwithmultilevelmultiheadattention AT chuxiongqin newjointctcattentionbasedspeechrecognitionmodelwithmultilevelmultiheadattention AT wenlinzhang newjointctcattentionbasedspeechrecognitionmodelwithmultilevelmultiheadattention AT danqu newjointctcattentionbasedspeechrecognitionmodelwithmultilevelmultiheadattention |
_version_ |
1724658250480091136 |