Improving Low-Resource Speech Recognition Based on Improved NN-HMM Structures

The performance of the ASR system is unsatisfactory in a low-resource environment. In this paper, we investigated the effectiveness of three approaches to improve the performance of the acoustic models in low-resource environments. They are Mono-and-triphone Learning, Soft One-hot Label and Feature...

Full description

Bibliographic Details
Main Authors: Xiusong Sun, Qun Yang, Shaohan Liu, Xin Yuan
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9069188/
id doaj-087680b08d5f43cdb66d29671a251eef
record_format Article
spelling doaj-087680b08d5f43cdb66d29671a251eef2021-03-30T01:45:31ZengIEEEIEEE Access2169-35362020-01-018730057301410.1109/ACCESS.2020.29883659069188Improving Low-Resource Speech Recognition Based on Improved NN-HMM StructuresXiusong Sun0https://orcid.org/0000-0003-0232-7069Qun Yang1https://orcid.org/0000-0001-6824-8473Shaohan Liu2https://orcid.org/0000-0001-6967-0262Xin Yuan3https://orcid.org/0000-0003-2261-1809College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, ChinaCollege of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, ChinaCollege of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, ChinaCollege of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, ChinaThe performance of the ASR system is unsatisfactory in a low-resource environment. In this paper, we investigated the effectiveness of three approaches to improve the performance of the acoustic models in low-resource environments. They are Mono-and-triphone Learning, Soft One-hot Label and Feature Combinations. We applied these three methods to the network architecture and compared their results with baselines. Our proposal has achieved remarkable improvement in the task of mandarin speech recognition in the hybrid hidden Markov model - neural network approach on phoneme level. In order to verify the generalization ability of our proposed method, we conducted many comparative experiments on DNN, RNN, LSTM and other network structures. The experimental results show that our method is applicable to almost all currently widely used network structures. Compared to baselines, our proposals achieved an average relative Character Error Rate (CER) reduction of 8.0%. In our experiments, the size of training data is ~10 hours, and we did not use data augmentation or transfer learning methods, which means that we did not use any additional data.https://ieeexplore.ieee.org/document/9069188/Low-resourcespeech recognitionmultitask learningacoustic modelingfeature combinations
collection DOAJ
language English
format Article
sources DOAJ
author Xiusong Sun
Qun Yang
Shaohan Liu
Xin Yuan
spellingShingle Xiusong Sun
Qun Yang
Shaohan Liu
Xin Yuan
Improving Low-Resource Speech Recognition Based on Improved NN-HMM Structures
IEEE Access
Low-resource
speech recognition
multitask learning
acoustic modeling
feature combinations
author_facet Xiusong Sun
Qun Yang
Shaohan Liu
Xin Yuan
author_sort Xiusong Sun
title Improving Low-Resource Speech Recognition Based on Improved NN-HMM Structures
title_short Improving Low-Resource Speech Recognition Based on Improved NN-HMM Structures
title_full Improving Low-Resource Speech Recognition Based on Improved NN-HMM Structures
title_fullStr Improving Low-Resource Speech Recognition Based on Improved NN-HMM Structures
title_full_unstemmed Improving Low-Resource Speech Recognition Based on Improved NN-HMM Structures
title_sort improving low-resource speech recognition based on improved nn-hmm structures
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2020-01-01
description The performance of the ASR system is unsatisfactory in a low-resource environment. In this paper, we investigated the effectiveness of three approaches to improve the performance of the acoustic models in low-resource environments. They are Mono-and-triphone Learning, Soft One-hot Label and Feature Combinations. We applied these three methods to the network architecture and compared their results with baselines. Our proposal has achieved remarkable improvement in the task of mandarin speech recognition in the hybrid hidden Markov model - neural network approach on phoneme level. In order to verify the generalization ability of our proposed method, we conducted many comparative experiments on DNN, RNN, LSTM and other network structures. The experimental results show that our method is applicable to almost all currently widely used network structures. Compared to baselines, our proposals achieved an average relative Character Error Rate (CER) reduction of 8.0%. In our experiments, the size of training data is ~10 hours, and we did not use data augmentation or transfer learning methods, which means that we did not use any additional data.
topic Low-resource
speech recognition
multitask learning
acoustic modeling
feature combinations
url https://ieeexplore.ieee.org/document/9069188/
work_keys_str_mv AT xiusongsun improvinglowresourcespeechrecognitionbasedonimprovednnhmmstructures
AT qunyang improvinglowresourcespeechrecognitionbasedonimprovednnhmmstructures
AT shaohanliu improvinglowresourcespeechrecognitionbasedonimprovednnhmmstructures
AT xinyuan improvinglowresourcespeechrecognitionbasedonimprovednnhmmstructures
_version_ 1724186399744196608