Towards multilingual end‐to‐end speech recognition for air traffic control

Abstract In this work, an end‐to‐end framework is proposed to achieve multilingual automatic speech recognition (ASR) in air traffic control (ATC) systems. Considering the standard ATC procedure, a recurrent neural network (RNN) based framework is selected to mine the temporal dependencies among spe...

Full description

Bibliographic Details
Main Authors:	Yi Lin, Bo Yang, Dongyue Guo, Peng Fan
Format:	Article
Language:	English
Published:	Wiley 2021-09-01
Series:	IET Intelligent Transport Systems
Online Access:	https://doi.org/10.1049/itr2.12094

id	doaj-40032cd49b7c41e2ae5ec2934487b38d
record_format	Article
spelling	doaj-40032cd49b7c41e2ae5ec2934487b38d2021-08-04T08:52:37ZengWileyIET Intelligent Transport Systems1751-956X1751-95782021-09-011591203121410.1049/itr2.12094Towards multilingual end‐to‐end speech recognition for air traffic controlYi Lin0Bo Yang1Dongyue Guo2Peng Fan3College of Computer Science Sichuan University Chengdu Sichuan ChinaCollege of Computer Science Sichuan University Chengdu Sichuan ChinaCollege of Computer Science Sichuan University Chengdu Sichuan ChinaCollege of Computer Science Sichuan University Chengdu Sichuan ChinaAbstract In this work, an end‐to‐end framework is proposed to achieve multilingual automatic speech recognition (ASR) in air traffic control (ATC) systems. Considering the standard ATC procedure, a recurrent neural network (RNN) based framework is selected to mine the temporal dependencies among speech frames. Facing the distributed feature space caused by the radio transmission, a hybrid feature embedding block is designed to extract high‐level representations, in which multiple convolutional neural networks are designed to accommodate different frequency and temporal resolutions. The residual mechanism is performed on the RNN layers to improve the trainability and the convergence. To integrate the multilingual ASR into a single model and relieve the class imbalance, a special vocabulary is designed to unify the pronunciation of the vocabulary in Chinese and English, i.e., pronunciation‐oriented vocabulary. The proposed model is optimized by the connectionist temporal classification loss and is validated on a real‐world speech corpus (ATCSpeech). A character error rate of 4.4% and 5.9% is achieved for Chinese and English speech, respectively, which outperforms other popular approaches. Most importantly, the proposed approach achieves the multilingual ASR task in an end‐to‐end manner with considerable high performance.https://doi.org/10.1049/itr2.12094
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Yi Lin Bo Yang Dongyue Guo Peng Fan
spellingShingle	Yi Lin Bo Yang Dongyue Guo Peng Fan Towards multilingual end‐to‐end speech recognition for air traffic control IET Intelligent Transport Systems
author_facet	Yi Lin Bo Yang Dongyue Guo Peng Fan
author_sort	Yi Lin
title	Towards multilingual end‐to‐end speech recognition for air traffic control
title_short	Towards multilingual end‐to‐end speech recognition for air traffic control
title_full	Towards multilingual end‐to‐end speech recognition for air traffic control
title_fullStr	Towards multilingual end‐to‐end speech recognition for air traffic control
title_full_unstemmed	Towards multilingual end‐to‐end speech recognition for air traffic control
title_sort	towards multilingual end‐to‐end speech recognition for air traffic control
publisher	Wiley
series	IET Intelligent Transport Systems
issn	1751-956X 1751-9578
publishDate	2021-09-01
description	Abstract In this work, an end‐to‐end framework is proposed to achieve multilingual automatic speech recognition (ASR) in air traffic control (ATC) systems. Considering the standard ATC procedure, a recurrent neural network (RNN) based framework is selected to mine the temporal dependencies among speech frames. Facing the distributed feature space caused by the radio transmission, a hybrid feature embedding block is designed to extract high‐level representations, in which multiple convolutional neural networks are designed to accommodate different frequency and temporal resolutions. The residual mechanism is performed on the RNN layers to improve the trainability and the convergence. To integrate the multilingual ASR into a single model and relieve the class imbalance, a special vocabulary is designed to unify the pronunciation of the vocabulary in Chinese and English, i.e., pronunciation‐oriented vocabulary. The proposed model is optimized by the connectionist temporal classification loss and is validated on a real‐world speech corpus (ATCSpeech). A character error rate of 4.4% and 5.9% is achieved for Chinese and English speech, respectively, which outperforms other popular approaches. Most importantly, the proposed approach achieves the multilingual ASR task in an end‐to‐end manner with considerable high performance.
url	https://doi.org/10.1049/itr2.12094
work_keys_str_mv	AT yilin towardsmultilingualendtoendspeechrecognitionforairtrafficcontrol AT boyang towardsmultilingualendtoendspeechrecognitionforairtrafficcontrol AT dongyueguo towardsmultilingualendtoendspeechrecognitionforairtrafficcontrol AT pengfan towardsmultilingualendtoendspeechrecognitionforairtrafficcontrol
_version_	1721222409895280640

Towards multilingual end‐to‐end speech recognition for air traffic control

Similar Items