Summary: | This thesis investigates speaker specific models trained with training sets with a number of different repetitions per text but focusing mainly on the models trained with only a few (less than 3) repetitions. This work aims to assess the abilities of a speaker model as the amount of training data increases while keeping the length of test utterances fixed. This theme is chosen because small data sets are problematic to the training of models for speech and speaker recognition. Small training set sizes regularly occur when training speaker specific models, as it is often difficult to collect a large amount of speaker specific data. In the first part of this work, three speaker recognition approaches, namely vector quantisation (VQ), dynamic time warping (DTW) and continuous density hidden Markov models (CDHMMs) are assessed. These experiments use increasing training set sizes which contain from 1 to 10 repetitions of each text to train each speaker model. Here the intent is to show which approach is most appropriate across the range of available training set sizes, for text-dependent and text-independent speaker recognition. This part concludes by suggesting that the TD DTW approach is best of all the chosen configurations. The second part of the work concerns adaptation using text-dependent CDHMMs. A new approach for adaptation called cumulative likelihood estimation (CLE) is introduced, and compared with the maximum <I>a</I> <I>posteriori </I>(MAP) approach and other benchmark results. The framework is chosen such that only single repetitions of each utterance are available for enrolment and subsequent adaptation of the speaker model. The objective is to assess whether creating speaker models through the use of an adaptation approach is a viable alternative to creating speaker models using stored speaker specific speech. It is concluded that both MAP and CLE are viable alternatives, and CLE in particular can create a model by adapting single repetitions of data which achieves performance as good as or better than that of an equivalent model, such as DTW, which has been trained using an equivalent amount of <I>stored </I>data.
|