Speech Intelligibility Measurement on the basis of ITU-T Recommendation P.863

Objective speech intelligibility measurement techniques like AI (Articulation Index) and AI based STI (Speech Transmission Index) fail to assess speech intelligibility in modern telecommunication networks that use several non-linear processing for enhancing speech. Moreover, these techniques do not...

Full description

Bibliographic Details
Main Author: GHIMIRE, SWATANTRA
Format: Others
Language:English
Published: Högskolan i Halmstad, Laboratoriet för intelligenta system 2012
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-20023
Description
Summary:Objective speech intelligibility measurement techniques like AI (Articulation Index) and AI based STI (Speech Transmission Index) fail to assess speech intelligibility in modern telecommunication networks that use several non-linear processing for enhancing speech. Moreover, these techniques do not allow prediction of single individual CVC (Consonant Vowel Consonant) word intelligibility scores. ITU-T P.863 standard [1], which was developed for assessing speech quality, is used as a starting point to develop a simple new model for predicting subjective speech intelligibility of individual CVC words. Subjective intelligibility measurements were carried out for a large set of speech degradations. The subjective test uses single CVC word presentations in an eight alternative closed response set experiment. Subjects assess individual degraded CVC words and an average of correct recognition is used as the intelligibility score for a particular CVC word. The first subjective database uses CVC words that have variations in the first consonant i.e. /C/ous (represented as "kæʊs" using International Phonetic Association phonetic alphabets). This database is used for developing the objective model, while a new database based on VC words (Vowel Consonant) that uses variations in the second consonant (a/C/ e.g. aH, aL) is used for validating the model. ITU-T P.863 shows very poor results with a correlation of 0.30 for the first subjective database. A first extension to make P.863 suited for intelligibility prediction is done by restructuring speech material to meet the temporal structure requirements (speech+silence+speech) set for standard P.863 measurements. The restructuring is done by concatenating every original and degraded CVC word with itself. There is no significant improvement in correlation (0.34) when using P.863 on the restructured first subjective database (speech material meets temporal requirements).  In this thesis a simple model based on P.863 is developed for assessing intelligibility of individual CVC words. The model uses a linear combination of a simple time clipping indicator (missing speech parts) and a “Good frame count” indicator which is based on the local perceptual (frame by frame) signal to noise ratio. Using this model on the restructured first database, a reasonably good correlation of 0.81 is seen between subjective scores and the model output values. For the validation database, a correlation of around 0.76 is obtained. Further validation on an existing database at TNO, which uses time clipping degradation only, shows an excellent correlation of 0.98. Although a reasonably good correlation is seen on the first database and the validation database, it is too low for reliable measurements. Further validation and development is required, nevertheless the results show that a perception-based technique that uses internal representations of signals can be used for predicting subjective intelligibility scores of individual CVC words.