Approaches for Modeling Noisy Speech

Bibliographic Details
Main Author: Xu, Sirui
Language:English
Published: The Ohio State University / OhioLINK 2018
Subjects:
Online Access:http://rave.ohiolink.edu/etdc/view?acc_num=osu1534680337238149
id ndltd-OhioLink-oai-etd.ohiolink.edu-osu1534680337238149
record_format oai_dc
collection NDLTD
language English
sources NDLTD
topic Computer Science
spellingShingle Computer Science
Xu, Sirui
Approaches for Modeling Noisy Speech
author Xu, Sirui
author_facet Xu, Sirui
author_sort Xu, Sirui
title Approaches for Modeling Noisy Speech
title_short Approaches for Modeling Noisy Speech
title_full Approaches for Modeling Noisy Speech
title_fullStr Approaches for Modeling Noisy Speech
title_full_unstemmed Approaches for Modeling Noisy Speech
title_sort approaches for modeling noisy speech
publisher The Ohio State University / OhioLINK
publishDate 2018
url http://rave.ohiolink.edu/etdc/view?acc_num=osu1534680337238149
work_keys_str_mv AT xusirui approachesformodelingnoisyspeech
_version_ 1719454541607337984
spelling ndltd-OhioLink-oai-etd.ohiolink.edu-osu15346803372381492021-08-03T07:08:18Z Approaches for Modeling Noisy Speech Xu, Sirui Computer Science In this dissertation, we present our work that focuses on improving noisy speech recognition.Although recent ASR research has achieved considerable improvement on cleandata, the mismatch between the lab data and the noise environment in realistic speech situationsis still a major challenge for further enhancing the recognition performance of speechapplications. The variety of background noise types, the multi-speaker talking scenario aswell as the distant speaking situation are just a few problems that need to be tackled.One of the common approaches for handling noisy speech recognition is to adoptspeech enhancement methods such as denoising and beamforming to improve the qualityof the speech audio before passing it to the downstream of the speech system. Our workinstead focuses on the decoding and acoustic modeling phases of the speech pipeline, andit can still be used in conjunction with speech enhancement methods to potentially furtherimprove the speech recognition systems.The first project in Chapter 2 proposes a WFST framework for single-pass multi-streamdecoding. This work focuses on the decoding stage of the speech system, and our proposedframework for integrating disparate automatic speech recognition systems achieves one-passdecoding by using vector semirings to extend the traditional WFST. This frameworkoffers flexibility in combining systems at different levels of the decoding pipeline, and our experiments showed that it achieved comparable performance as MBR-based combinationwhile significantly reducing the computation time. The framework is also relativelymemory-efficient due to the shared decoding structure between the streams.In Chapter 3, we integrate transfer learning and system combination techniques. We applyProgressive Neural Networks (ProgNets) on modeling noisy acoustic speech and thenemploy our WFST system to achieve system combination in the decoding phase. To takeadvantage of the ability of ProgNets in transferring knowledge between different domainsor datasets, we sub-divided the data according to different noise conditions so that thetrained models can share the information about the noisy data. In addition, the word-levelcombination in the multi-stream WFST framework further helps improve the performance,as it can make use of a longer range of acoustic information for the combination. Combiningthe two techniques, our experiments achieved considerable improvement over thebaseline, which is a 7-layer DNN speaker independent system. We also compared the performanceof our system with that of frame-level acoustic fusion techniques and observedreduction in WER.The work in Chapter 4 introduces spatial and channel attention into the modeling ofnoisy speech to suppress noise and emphasize informative acoustic features during theacoustic modeling process. The spatial attention mechanism is implemented as hourglassstructures where the input features are first down sampled and then up sampled to generateattention maps, which are anticipated to assign higher weights to more importantfeatures and lower weights to noise features. At each block of the ResNet, CNN featuresare composed with the attention maps to learn to attend to the most salient acoustic featuresand suppress noises. On the other hand, the channel attention learns to attend to differentchannels according to their importance in feature maps. ResNet blocks with the spatial and channel attention modules can be easily stacked up to generate deeper networks. Weexperimented with the attended ResNet on noisy datasets and achieved promising WERreductions.We conclude the dissertation in Chapter 5 with a summary of contributions and a discussionon directions for future research. 2018 English text The Ohio State University / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=osu1534680337238149 http://rave.ohiolink.edu/etdc/view?acc_num=osu1534680337238149 unrestricted This thesis or dissertation is protected by copyright: all rights reserved. It may not be copied or redistributed beyond the terms of applicable copyright laws.