Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection

博士 === 國立臺灣大學 === 電機工程學研究所 === 105 === In the era of big data, huge quantities of raw speech data is easy to obtain, but annotated speech data remain hard to acquire. This leads to the increased importance of unsupervised learning scenarios where annotated data is not required, a typical application...

Full description

Bibliographic Details
Main Authors: Cheng-Tao Chung, 鍾承道
Other Authors: Lin-Shan Lee
Format: Others
Language:en_US
Published: 2017
Online Access:http://ndltd.ncl.edu.tw/handle/p9p96r
id ndltd-TW-105NTU05442082
record_format oai_dc
spelling ndltd-TW-105NTU054420822019-05-15T23:39:40Z http://ndltd.ncl.edu.tw/handle/p9p96r Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection 無監督式結構化語音模型和語音特徵及其在語音檢索的運用 Cheng-Tao Chung 鍾承道 博士 國立臺灣大學 電機工程學研究所 105 In the era of big data, huge quantities of raw speech data is easy to obtain, but annotated speech data remain hard to acquire. This leads to the increased importance of unsupervised learning scenarios where annotated data is not required, a typical application for which is the Query-by-Example Spoken Term Detection (QbE-STD). With the dominant paradigm of automatic speech recognition (ASR) technologies being supervised learning, such a scenario is still a relatively less explored area. In this thesis, we present the Hierarchical Paradigm and the Multi-granularity Paradigm for unsupervised discovery of structured acoustic tokens directly from speech corpora. The Hierarchical Paradigm attempts to jointly learn two level of representations that are correlated to phonemes and words. The Multi-granularity Paradigm makes no assumptions on which set of tokens to select, and seeks to capture all available information with multiple sets of tokens with different model granularities. Furthermore, unsupervised speech features can be extracted using the Multi-granular acoustic tokens with a framework which we call the Multi-granular Acoustic Tokenizing Deep Neural Network (MAT-DNN). We unified the two paradigms in a single theoretical framework and performed query-by-example spoken term detection experiments on the token sets and frame-level features. The theories and principles on acoustic tokens and frame-level features proposed in this thesis are supported by competitive results against strong baselines on standard corpora using well-defined metrics. Lin-Shan Lee 李琳山 2017 學位論文 ; thesis 97 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 博士 === 國立臺灣大學 === 電機工程學研究所 === 105 === In the era of big data, huge quantities of raw speech data is easy to obtain, but annotated speech data remain hard to acquire. This leads to the increased importance of unsupervised learning scenarios where annotated data is not required, a typical application for which is the Query-by-Example Spoken Term Detection (QbE-STD). With the dominant paradigm of automatic speech recognition (ASR) technologies being supervised learning, such a scenario is still a relatively less explored area. In this thesis, we present the Hierarchical Paradigm and the Multi-granularity Paradigm for unsupervised discovery of structured acoustic tokens directly from speech corpora. The Hierarchical Paradigm attempts to jointly learn two level of representations that are correlated to phonemes and words. The Multi-granularity Paradigm makes no assumptions on which set of tokens to select, and seeks to capture all available information with multiple sets of tokens with different model granularities. Furthermore, unsupervised speech features can be extracted using the Multi-granular acoustic tokens with a framework which we call the Multi-granular Acoustic Tokenizing Deep Neural Network (MAT-DNN). We unified the two paradigms in a single theoretical framework and performed query-by-example spoken term detection experiments on the token sets and frame-level features. The theories and principles on acoustic tokens and frame-level features proposed in this thesis are supported by competitive results against strong baselines on standard corpora using well-defined metrics.
author2 Lin-Shan Lee
author_facet Lin-Shan Lee
Cheng-Tao Chung
鍾承道
author Cheng-Tao Chung
鍾承道
spellingShingle Cheng-Tao Chung
鍾承道
Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection
author_sort Cheng-Tao Chung
title Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection
title_short Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection
title_full Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection
title_fullStr Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection
title_full_unstemmed Unsupervised Discovery of Structured Acoustic Tokens and Speech Features with Applications to Spoken Term Detection
title_sort unsupervised discovery of structured acoustic tokens and speech features with applications to spoken term detection
publishDate 2017
url http://ndltd.ncl.edu.tw/handle/p9p96r
work_keys_str_mv AT chengtaochung unsuperviseddiscoveryofstructuredacoustictokensandspeechfeatureswithapplicationstospokentermdetection
AT zhōngchéngdào unsuperviseddiscoveryofstructuredacoustictokensandspeechfeatureswithapplicationstospokentermdetection
AT chengtaochung wújiāndūshìjiégòuhuàyǔyīnmóxínghéyǔyīntèzhēngjíqízàiyǔyīnjiǎnsuǒdeyùnyòng
AT zhōngchéngdào wújiāndūshìjiégòuhuàyǔyīnmóxínghéyǔyīntèzhēngjíqízàiyǔyīnjiǎnsuǒdeyùnyòng
_version_ 1719151893335244800