Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection
博士 === 國立臺灣大學 === 電機工程學研究所 === 105 === In the era of big data, huge quantities of raw speech data is easy to obtain, but annotated speech data remain hard to acquire. This leads to the increased importance of unsupervised learning scenarios where annotated data is not required, a typical application...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2017
|
Online Access: | http://ndltd.ncl.edu.tw/handle/p9p96r |
id |
ndltd-TW-105NTU05442082 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-105NTU054420822019-05-15T23:39:40Z http://ndltd.ncl.edu.tw/handle/p9p96r Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection 無監督式結構化語音模型和語音特徵及其在語音檢索的運用 Cheng-Tao Chung 鍾承道 博士 國立臺灣大學 電機工程學研究所 105 In the era of big data, huge quantities of raw speech data is easy to obtain, but annotated speech data remain hard to acquire. This leads to the increased importance of unsupervised learning scenarios where annotated data is not required, a typical application for which is the Query-by-Example Spoken Term Detection (QbE-STD). With the dominant paradigm of automatic speech recognition (ASR) technologies being supervised learning, such a scenario is still a relatively less explored area. In this thesis, we present the Hierarchical Paradigm and the Multi-granularity Paradigm for unsupervised discovery of structured acoustic tokens directly from speech corpora. The Hierarchical Paradigm attempts to jointly learn two level of representations that are correlated to phonemes and words. The Multi-granularity Paradigm makes no assumptions on which set of tokens to select, and seeks to capture all available information with multiple sets of tokens with different model granularities. Furthermore, unsupervised speech features can be extracted using the Multi-granular acoustic tokens with a framework which we call the Multi-granular Acoustic Tokenizing Deep Neural Network (MAT-DNN). We unified the two paradigms in a single theoretical framework and performed query-by-example spoken term detection experiments on the token sets and frame-level features. The theories and principles on acoustic tokens and frame-level features proposed in this thesis are supported by competitive results against strong baselines on standard corpora using well-defined metrics. Lin-Shan Lee 李琳山 2017 學位論文 ; thesis 97 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
博士 === 國立臺灣大學 === 電機工程學研究所 === 105 === In the era of big data, huge quantities of raw speech data is easy to obtain, but annotated speech data remain hard to acquire. This leads to the increased importance of unsupervised learning scenarios where annotated data is not required, a typical application for which is the Query-by-Example Spoken Term Detection (QbE-STD). With the dominant paradigm of automatic speech recognition (ASR) technologies being supervised learning, such a scenario is still a relatively less explored area. In this thesis, we present the Hierarchical Paradigm and the Multi-granularity Paradigm for unsupervised discovery of structured acoustic tokens directly from speech corpora. The Hierarchical Paradigm attempts to jointly learn two level of representations that are correlated to phonemes and words. The Multi-granularity Paradigm makes no assumptions on which set of tokens to select, and seeks to capture all available information with multiple sets of tokens with different model granularities. Furthermore, unsupervised speech features can be extracted using the Multi-granular acoustic tokens with a framework which we call the Multi-granular Acoustic Tokenizing Deep Neural Network (MAT-DNN). We unified the two paradigms in a single theoretical framework and performed query-by-example spoken term detection experiments on the token sets and frame-level features. The theories and principles on acoustic tokens and frame-level features proposed in this thesis are supported by competitive results against strong baselines on standard corpora using well-defined metrics.
|
author2 |
Lin-Shan Lee |
author_facet |
Lin-Shan Lee Cheng-Tao Chung 鍾承道 |
author |
Cheng-Tao Chung 鍾承道 |
spellingShingle |
Cheng-Tao Chung 鍾承道 Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection |
author_sort |
Cheng-Tao Chung |
title |
Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection |
title_short |
Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection |
title_full |
Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection |
title_fullStr |
Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection |
title_full_unstemmed |
Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection |
title_sort |
unsupervised discovery of structured acoustic tokens and speech features with applications to spoken term detection |
publishDate |
2017 |
url |
http://ndltd.ncl.edu.tw/handle/p9p96r |
work_keys_str_mv |
AT chengtaochung unsuperviseddiscoveryofstructuredacoustictokensandspeechfeatureswithapplicationstospokentermdetection AT zhōngchéngdào unsuperviseddiscoveryofstructuredacoustictokensandspeechfeatureswithapplicationstospokentermdetection AT chengtaochung wújiāndūshìjiégòuhuàyǔyīnmóxínghéyǔyīntèzhēngjíqízàiyǔyīnjiǎnsuǒdeyùnyòng AT zhōngchéngdào wújiāndūshìjiégòuhuàyǔyīnmóxínghéyǔyīntèzhēngjíqízàiyǔyīnjiǎnsuǒdeyùnyòng |
_version_ |
1719151893335244800 |