Speaker recognition using complementary information from vocal source and vocal tract.
Experimental results show that source-tract information fusion can also improve the robustness of speaker recognition systems in mismatched conditions. For example, relative improvements of 15.3% and 12.6% have been achieved for speaker identification and verification, respectively. === For speaker...
Other Authors: | |
---|---|
Format: | Others |
Language: | English Chinese |
Published: |
2005
|
Subjects: | |
Online Access: | http://library.cuhk.edu.hk/record=b6074159 http://repository.lib.cuhk.edu.hk/en/item/cuhk-343788 |
id |
ndltd-cuhk.edu.hk-oai-cuhk-dr-cuhk_343788 |
---|---|
record_format |
oai_dc |
collection |
NDLTD |
language |
English Chinese |
format |
Others
|
sources |
NDLTD |
topic |
Human-computer interaction Speech processing systems |
spellingShingle |
Human-computer interaction Speech processing systems Speaker recognition using complementary information from vocal source and vocal tract. |
description |
Experimental results show that source-tract information fusion can also improve the robustness of speaker recognition systems in mismatched conditions. For example, relative improvements of 15.3% and 12.6% have been achieved for speaker identification and verification, respectively. === For speaker verification, a text-dependent weighting scheme is developed. Analysis results show that the source-tract discrimination ratio varies significantly across different sounds due to the diversity of vocal system configurations in speech production. This thesis analyzes the source-tract speaker discrimination ratio for the 10 Cantonese digits, upon which a digit-dependent source-tract weighting scheme is developed. Information fusion with such digit-dependent weights relatively improves the verification performance by 39.6% in matched conditions. === This thesis investigates the feasibility of using both vocal source and vocal tract information to improve speaker recognition performance. Conventional speaker recognition systems typically employ vocal tract related acoustic features, e.g the Mel-frequency cepstral coefficients (MFCC), for discriminative purpose. Motivated by the physiological significance of the vocal source and vocal tract system in speech production, this thesis develops a speaker recognition system to effectively incorporate these two complementary information sources for improved performance and robustness. === This thesis presents a novel approach of representing the speaker-specific vocal source characteristics. The linear predictive (LP) residual signal is adopted as a good representative of the vocal source excitation, in which the speaker specific information resides on both time and frequency domains. Haar transform and wavelet transform are applied for multi-resolution analyses of the LP residual signal. The resulting vocal source features, namely the Haar octave coefficients of residues (HOCOR) and wavelet octave coefficients of residues (WOCOR), can effectively extract the speaker-specific spectro-temporal characteristics of the LP residual signal. Particularly, with pitch-synchronous wavelet transform, the WOCOR feature set is capable of capturing the pitch-related low frequency properties and the high frequency information associated with pitch epochs, as well as their temporal variations within a pitch period and over consecutive periods. The generated vocal source and vocal tract features are complementary to each other since they are derived from two orthogonal components, the LP residual signal and LP coefficients. Therefore they can be fused to provide better speaker recognition performance. A preliminary scheme of fusing MFCC and WOCOR together illustrated that the identification and verification performance can be respectively improved by 34.6% and 23.6%, both in matched conditions. === To maximize the benefit obtained through the fusion of source and tract information, speaker discrimination dependent fusion techniques have been developed. For speaker identification, a confidence measure, which indicates the reliability of vocal source feature in speaker identification, is derived based on the discrimination ratio between the source and tract features in each identification trial. Information fusion with confidence measure offers better weighted scores given by the two features and avoids possible errors introduced by incorporating source information, thereby improves the identification performance further. Compared with MFCC, relative improvement of 46.8% has been achieved. === Zheng Nengheng. === "November 2005." === Adviser: Pak-Chung Ching. === Source: Dissertation Abstracts International, Volume: 67-11, Section: B, page: 6647. === Thesis (Ph.D.)--Chinese University of Hong Kong, 2005. === Includes bibliographical references (p. 123-135). === Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. === Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. === Abstracts in English and Chinese. === School code: 1307. |
author2 |
Zheng, Nengheng. |
author_facet |
Zheng, Nengheng. |
title |
Speaker recognition using complementary information from vocal source and vocal tract. |
title_short |
Speaker recognition using complementary information from vocal source and vocal tract. |
title_full |
Speaker recognition using complementary information from vocal source and vocal tract. |
title_fullStr |
Speaker recognition using complementary information from vocal source and vocal tract. |
title_full_unstemmed |
Speaker recognition using complementary information from vocal source and vocal tract. |
title_sort |
speaker recognition using complementary information from vocal source and vocal tract. |
publishDate |
2005 |
url |
http://library.cuhk.edu.hk/record=b6074159 http://repository.lib.cuhk.edu.hk/en/item/cuhk-343788 |
_version_ |
1718978180983816192 |
spelling |
ndltd-cuhk.edu.hk-oai-cuhk-dr-cuhk_3437882019-02-19T03:43:38Z Speaker recognition using complementary information from vocal source and vocal tract. CUHK electronic theses & dissertations collection Human-computer interaction Speech processing systems Experimental results show that source-tract information fusion can also improve the robustness of speaker recognition systems in mismatched conditions. For example, relative improvements of 15.3% and 12.6% have been achieved for speaker identification and verification, respectively. For speaker verification, a text-dependent weighting scheme is developed. Analysis results show that the source-tract discrimination ratio varies significantly across different sounds due to the diversity of vocal system configurations in speech production. This thesis analyzes the source-tract speaker discrimination ratio for the 10 Cantonese digits, upon which a digit-dependent source-tract weighting scheme is developed. Information fusion with such digit-dependent weights relatively improves the verification performance by 39.6% in matched conditions. This thesis investigates the feasibility of using both vocal source and vocal tract information to improve speaker recognition performance. Conventional speaker recognition systems typically employ vocal tract related acoustic features, e.g the Mel-frequency cepstral coefficients (MFCC), for discriminative purpose. Motivated by the physiological significance of the vocal source and vocal tract system in speech production, this thesis develops a speaker recognition system to effectively incorporate these two complementary information sources for improved performance and robustness. This thesis presents a novel approach of representing the speaker-specific vocal source characteristics. The linear predictive (LP) residual signal is adopted as a good representative of the vocal source excitation, in which the speaker specific information resides on both time and frequency domains. Haar transform and wavelet transform are applied for multi-resolution analyses of the LP residual signal. The resulting vocal source features, namely the Haar octave coefficients of residues (HOCOR) and wavelet octave coefficients of residues (WOCOR), can effectively extract the speaker-specific spectro-temporal characteristics of the LP residual signal. Particularly, with pitch-synchronous wavelet transform, the WOCOR feature set is capable of capturing the pitch-related low frequency properties and the high frequency information associated with pitch epochs, as well as their temporal variations within a pitch period and over consecutive periods. The generated vocal source and vocal tract features are complementary to each other since they are derived from two orthogonal components, the LP residual signal and LP coefficients. Therefore they can be fused to provide better speaker recognition performance. A preliminary scheme of fusing MFCC and WOCOR together illustrated that the identification and verification performance can be respectively improved by 34.6% and 23.6%, both in matched conditions. To maximize the benefit obtained through the fusion of source and tract information, speaker discrimination dependent fusion techniques have been developed. For speaker identification, a confidence measure, which indicates the reliability of vocal source feature in speaker identification, is derived based on the discrimination ratio between the source and tract features in each identification trial. Information fusion with confidence measure offers better weighted scores given by the two features and avoids possible errors introduced by incorporating source information, thereby improves the identification performance further. Compared with MFCC, relative improvement of 46.8% has been achieved. Zheng Nengheng. "November 2005." Adviser: Pak-Chung Ching. Source: Dissertation Abstracts International, Volume: 67-11, Section: B, page: 6647. Thesis (Ph.D.)--Chinese University of Hong Kong, 2005. Includes bibliographical references (p. 123-135). Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. Abstracts in English and Chinese. School code: 1307. Zheng, Nengheng. Chinese University of Hong Kong Graduate School. Division of Electronic Engineering. 2005 Text theses electronic resource microform microfiche 1 online resource (xiv, 135 p. : ill.) cuhk:343788 isbn: 9780542965791 http://library.cuhk.edu.hk/record=b6074159 eng chi Use of this resource is governed by the terms and conditions of the Creative Commons “Attribution-NonCommercial-NoDerivatives 4.0 International” License (http://creativecommons.org/licenses/by-nc-nd/4.0/) http://repository.lib.cuhk.edu.hk/en/islandora/object/cuhk%3A343788/datastream/TN/view/Speaker%20recognition%20using%20complementary%20information%20from%20vocal%20source%20and%20vocal%20tract.jpghttp://repository.lib.cuhk.edu.hk/en/item/cuhk-343788 |