Speech Separation with Time-and-Frequency Cross-DomainJoint Embedding and Clustering

碩士 === 國立臺灣大學 === 資訊網路與多媒體研究所 === 107 === The main topic of this thesis is to explore Speaker Independent Speech Separation technique, that is, to separate two or more speaker in a mixed speech without the speaker information. This is useful in many speech processing systems, including speech recogn...

Full description

Bibliographic Details
Main Authors: Gene-Ping Yang, 楊靖平
Other Authors: 李琳山
Format: Others
Language:zh-TW
Published: 2019
Online Access:http://ndltd.ncl.edu.tw/handle/7gknjc
id ndltd-TW-107NTU05641029
record_format oai_dc
spelling ndltd-TW-107NTU056410292019-11-16T05:28:00Z http://ndltd.ncl.edu.tw/handle/7gknjc Speech Separation with Time-and-Frequency Cross-DomainJoint Embedding and Clustering 基於時頻跨域共同嵌入及聚類之語音分離 Gene-Ping Yang 楊靖平 碩士 國立臺灣大學 資訊網路與多媒體研究所 107 The main topic of this thesis is to explore Speaker Independent Speech Separation technique, that is, to separate two or more speaker in a mixed speech without the speaker information. This is useful in many speech processing systems, including speech recognition, speaker recognition, etc. When there are two or more speakers in the audio, our goal is to separate these voices with similar characteristics. At present, deep learning method is mainly divided into two major mainstreams: the frequency domain method and the time domain method. The biggest difference between the two is the input of the model, one input is the original time domain waveform, and the other input is the frequency domain spectrum obtained by short-time Fourier transform. These two methods also use different model architectures to handle these two different inputs, but each has its own drawbacks. This paper proposes a separation technique based on time-and-frequency cross-domain joint embedding and clustering, which allows two different fields of input signals (time domain and frequency domain) to be referenced to each other. We are mainly based on convolution-like neural network modeling, and the method proposed in this round is one of the best performing algorithms in speech-independent speech separation technology. In this paper, we will mainly analyze the influence of different types of neural modules on this problem, and analyze the advantages and disadvantages of different modules in solving the speaker-independent speech separation problem through experimental data. 李琳山 2019 學位論文 ; thesis 76 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立臺灣大學 === 資訊網路與多媒體研究所 === 107 === The main topic of this thesis is to explore Speaker Independent Speech Separation technique, that is, to separate two or more speaker in a mixed speech without the speaker information. This is useful in many speech processing systems, including speech recognition, speaker recognition, etc. When there are two or more speakers in the audio, our goal is to separate these voices with similar characteristics. At present, deep learning method is mainly divided into two major mainstreams: the frequency domain method and the time domain method. The biggest difference between the two is the input of the model, one input is the original time domain waveform, and the other input is the frequency domain spectrum obtained by short-time Fourier transform. These two methods also use different model architectures to handle these two different inputs, but each has its own drawbacks. This paper proposes a separation technique based on time-and-frequency cross-domain joint embedding and clustering, which allows two different fields of input signals (time domain and frequency domain) to be referenced to each other. We are mainly based on convolution-like neural network modeling, and the method proposed in this round is one of the best performing algorithms in speech-independent speech separation technology. In this paper, we will mainly analyze the influence of different types of neural modules on this problem, and analyze the advantages and disadvantages of different modules in solving the speaker-independent speech separation problem through experimental data.
author2 李琳山
author_facet 李琳山
Gene-Ping Yang
楊靖平
author Gene-Ping Yang
楊靖平
spellingShingle Gene-Ping Yang
楊靖平
Speech Separation with Time-and-Frequency Cross-DomainJoint Embedding and Clustering
author_sort Gene-Ping Yang
title Speech Separation with Time-and-Frequency Cross-DomainJoint Embedding and Clustering
title_short Speech Separation with Time-and-Frequency Cross-DomainJoint Embedding and Clustering
title_full Speech Separation with Time-and-Frequency Cross-DomainJoint Embedding and Clustering
title_fullStr Speech Separation with Time-and-Frequency Cross-DomainJoint Embedding and Clustering
title_full_unstemmed Speech Separation with Time-and-Frequency Cross-DomainJoint Embedding and Clustering
title_sort speech separation with time-and-frequency cross-domainjoint embedding and clustering
publishDate 2019
url http://ndltd.ncl.edu.tw/handle/7gknjc
work_keys_str_mv AT genepingyang speechseparationwithtimeandfrequencycrossdomainjointembeddingandclustering
AT yángjìngpíng speechseparationwithtimeandfrequencycrossdomainjointembeddingandclustering
AT genepingyang jīyúshípínkuàyùgòngtóngqiànrùjíjùlèizhīyǔyīnfēnlí
AT yángjìngpíng jīyúshípínkuàyùgòngtóngqiànrùjíjùlèizhīyǔyīnfēnlí
_version_ 1719292829738467328