Acoustic Model Training for Speech Recognition
碩士 === 國立臺北大學 === 通訊工程研究所 === 103 === Speech is the primary source of communication amongst human beings. This form of communication is the fundamental thread that underlies the progress of human evolution since time immemorial. Therefore, speech communication is the thread that is interwoven in the...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2015
|
Online Access: | http://ndltd.ncl.edu.tw/handle/39876544496883454574 |
id |
ndltd-TW-103NTPU0650013 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-103NTPU06500132016-08-19T04:10:36Z http://ndltd.ncl.edu.tw/handle/39876544496883454574 Acoustic Model Training for Speech Recognition 語音辨識系統之聲學模型訓練研究 Nhlanhla Simanga Dlamini 嵐奇 碩士 國立臺北大學 通訊工程研究所 103 Speech is the primary source of communication amongst human beings. This form of communication is the fundamental thread that underlies the progress of human evolution since time immemorial. Therefore, speech communication is the thread that is interwoven in the fabric of every human culture. A compelling reason to study and work with speech is that it is indubitably the most common form of communication within the human community, rendering speech communication ubiquitous and pervasive. This thesis seeks to investigate and develop the acoustic model training for a speech recognition system. This is the first step in building a speech recognition system. Underlying such training is a speech recognition engine whose applications are multifold. Applications of a speech recognition system are, inter alia; 1. Small vocabulary keyword recognition over dial-up telephone lines. 2. Medium size vocabulary voice interactive command and control systems, e.g. IVR in telecommunications and voice activated systems in automobiles, banks and disabled community. 3. Limited domain speech translation. Training refers to the process of parameter estimation; that seeks to maximize the likelihood of the observation given a string of words. First and foremost, a hierarchical approach is adopted wherein different sources of information are represented. This is motivated by the fact that speech recognition depends on vocabulary, language model and HMM models. In this thesis, speech is modeled by HMM states, pronunciation dictionary is the TIMIT dictionary and the training corpus is TIMIT training data. In this thesis, a network representing these three sources of information will be built by using FST technology. The motivation for this study is dual fold. It is motivated by the academic requirements as well as the professional need to fulfill a long standing desire of doing something great. From the academic perspective, algorithms, methods and techniques (mathematical and computational) are sought to conceptualize the training as a first step in building a speech recognition system. The success of this academic endeavor will culminate to the application of the similar ideas in industry. As I am devoted to this study, professionally, as I am a telecommunication engineer, I can think of immediate applications of this research in industry. It turns out that speech communication can be extended to human-machine communication. To make a machine know and respond to speech is a task of pattern recognition, which this thesis endeavors to investigate and study. Success to this extension implies an increasing domain of speech communication and that success is attributable to the wider community of speech researchers and professionals. As a result of the increasing domain of speech communication, businesses will also wish to embrace the concept of human-machine speech communication. One of the driving forces to embrace such a concept is the economic and social conveniences that come with this form of communication. By embracing such a technology, benefits are huge. Interacting with a machine will be effortless, giving rise to the number of people making use of the technology, e.g. the disabled and elderly community. CHEN-YU Chiang 江振宇 2015 學位論文 ; thesis 79 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立臺北大學 === 通訊工程研究所 === 103 === Speech is the primary source of communication amongst human beings. This form of communication is the fundamental thread that underlies the progress of human evolution since time immemorial. Therefore, speech communication is the thread that is interwoven in the fabric of every human culture. A compelling reason to study and work with speech is that it is indubitably the most common form of communication within the human community, rendering speech communication ubiquitous and pervasive.
This thesis seeks to investigate and develop the acoustic model training for a speech recognition system. This is the first step in building a speech recognition system. Underlying such training is a speech recognition engine whose applications are multifold. Applications of a speech recognition system are, inter alia;
1. Small vocabulary keyword recognition over dial-up telephone lines.
2. Medium size vocabulary voice interactive command and control systems, e.g. IVR in telecommunications and voice activated systems in automobiles, banks and disabled community.
3. Limited domain speech translation.
Training refers to the process of parameter estimation; that seeks to maximize the likelihood of the observation given a string of words. First and foremost, a hierarchical approach is adopted wherein different sources of information are represented. This is motivated by the fact that speech recognition depends on vocabulary, language model and HMM models. In this thesis, speech is modeled by HMM states, pronunciation dictionary is the TIMIT dictionary and the training corpus is TIMIT training data. In this thesis, a network representing these three sources of information will be built by using FST technology.
The motivation for this study is dual fold. It is motivated by the academic requirements as well as the professional need to fulfill a long standing desire of doing something great. From the academic perspective, algorithms, methods and techniques (mathematical and computational) are sought to conceptualize the training as a first step in building a speech recognition system. The success of this academic endeavor will culminate to the application of the similar ideas in industry. As I am devoted to this study, professionally, as I am a telecommunication engineer, I can think of immediate applications of this research in industry.
It turns out that speech communication can be extended to human-machine communication. To make a machine know and respond to speech is a task of pattern recognition, which this thesis endeavors to investigate and study. Success to this extension implies an increasing domain of speech communication and that success is attributable to the wider community of speech researchers and professionals. As a result of the increasing domain of speech communication, businesses will also wish to embrace the concept of human-machine speech communication. One of the driving forces to embrace such a concept is the economic and social conveniences that come with this form of communication. By embracing such a technology, benefits are huge. Interacting with a machine will be effortless, giving rise to the number of people making use of the technology, e.g. the disabled and elderly community.
|
author2 |
CHEN-YU Chiang |
author_facet |
CHEN-YU Chiang Nhlanhla Simanga Dlamini 嵐奇 |
author |
Nhlanhla Simanga Dlamini 嵐奇 |
spellingShingle |
Nhlanhla Simanga Dlamini 嵐奇 Acoustic Model Training for Speech Recognition |
author_sort |
Nhlanhla Simanga Dlamini |
title |
Acoustic Model Training for Speech Recognition |
title_short |
Acoustic Model Training for Speech Recognition |
title_full |
Acoustic Model Training for Speech Recognition |
title_fullStr |
Acoustic Model Training for Speech Recognition |
title_full_unstemmed |
Acoustic Model Training for Speech Recognition |
title_sort |
acoustic model training for speech recognition |
publishDate |
2015 |
url |
http://ndltd.ncl.edu.tw/handle/39876544496883454574 |
work_keys_str_mv |
AT nhlanhlasimangadlamini acousticmodeltrainingforspeechrecognition AT lánqí acousticmodeltrainingforspeechrecognition AT nhlanhlasimangadlamini yǔyīnbiànshíxìtǒngzhīshēngxuémóxíngxùnliànyánjiū AT lánqí yǔyīnbiànshíxìtǒngzhīshēngxuémóxíngxùnliànyánjiū |
_version_ |
1718378670429569024 |