Biologically inspired speaker verification

Speaker verification is an active research problem that has been addressed using a variety of different classification techniques. However, in general, methods inspired by the human auditory system tend to show better verification performance than other methods. In this thesis three biologically ins...

Full description

Bibliographic Details
Main Author: Tashan, T.
Published: Nottingham Trent University 2012
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.629244
Description
Summary:Speaker verification is an active research problem that has been addressed using a variety of different classification techniques. However, in general, methods inspired by the human auditory system tend to show better verification performance than other methods. In this thesis three biologically inspired speaker verification algorithms are presented. The first is a vowel-dependent speaker verification method that uses a modified Self Organising Map (SOM) algorithm. For each speaker, a seeded SOM is trained to produce representative Discrete Fourier Transform (DFT) models of three vowels from a spoken input using positive samples only. This SOM training is performed both during a registration phase and during each subsequent verification attempt. Speaker verification is achieved by computing the Euclidean distance between the registration and verification SOM trained weight sets. An analysis of the comparative system performance when using DFT input vectors, as well as Linear Prediction Code (LPC) spectrum and Mel Frequency Cepstrum Coefficients (MFCC) alternative input features indicates that the DFT spectrum outperforms both MFCC and LPC features. The algorithm was evaluated using 50 speakers from the Centre for Spoken Language Understanding (CSLU2002) speaker verification database. The second method consists of two neural network stages. The first stage is the modified SOM which now operates as a vowel clustering stage that filters the input speech data and separates it into three sets of vowel information. The second stage then contains three Multi Layer Perceptron (MLP) networks; each acting as a distinct vowel verifier. Adding this second stage allows the use of negative sample training. The input of each MLP network is the respective filtered output vowel data from the first stage. The DFT spectrum is again used as the input feature vector due to its optimal performance in the first algorithm. The overall system was evaluated using the same dataset as used in the first algorithm, showing improved verification performance when compared to the algorithm without using the MLP stage. The third biologically plausible method is a speaker verification algorithm that uses a positive-sample-only trained self organising map composed of spiking neurons. The architecture of the system is inspired by the biomechanical mechanism of the human auditory system which converts speech into electrical spikes inside the cochlea. A spike-based rank order coding input feature vector is proposed that is designed to be representative of the real biological spike trains found within the human auditory nerve. The Spiking Self Organising Map (SSOM) updates its winner neuron only when its activity exceeds a specified threshold. The algorithm is evaluated using the same 50 speaker dataset from the CSLU2002 speaker verification database and the results indicate that the SSOM verification performance is comparable to the non-spike based SOM. Finally, a new speech detection technique to detect speech activity within speech signals is also proposed. This novel technique uses the linear correlation coefficient (Parson Coefficient). The correlation is calculated in the frequency domain between neighbouring frames of DFT spectrum feature vectors. By summing the correlation coefficients within a sliding window over time, a correlation envelope is produced, which can be used to identify speech activity. The proposed technique is compared with a conventional energy frame analysis method and shows greater robustness against changes in speech volume level. A comparison of the two techniques, in terms of speaker verification application performance, is presented in Appendix A using 240 speech waveforms from the CSLU2002 speaker verification database.