Reconstruction of intelligible audio speech from visual speech information

The aim of the work conducted in this thesis is to reconstruct audio speech signals using information which can be extracted solely from a visual stream of a speaker's face, with application for surveillance scenarios and silent speech interfaces. Visual speech is limited to that which can be s...

Full description

Bibliographic Details
Main Author:	Le Cornu, Thomas
Published:	University of East Anglia 2016
Subjects:	004
Online Access:	https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.743290

id	ndltd-bl.uk-oai-ethos.bl.uk-743290
record_format	oai_dc
spelling	ndltd-bl.uk-oai-ethos.bl.uk-7432902019-03-05T15:44:01ZReconstruction of intelligible audio speech from visual speech informationLe Cornu, Thomas2016The aim of the work conducted in this thesis is to reconstruct audio speech signals using information which can be extracted solely from a visual stream of a speaker's face, with application for surveillance scenarios and silent speech interfaces. Visual speech is limited to that which can be seen of the mouth, lips, teeth, and tongue, where the visual articulators convey considerably less information than in the audio domain, leading to the task being difficult. Accordingly, the emphasis is on the reconstruction of intelligible speech, with less regard given to quality. A speech production model is used to reconstruct audio speech, where methods are presented in this work for generating or estimating the necessary parameters for the model. Three approaches are explored for producing spectral-envelope estimates from visual features as this parameter provides the greatest contribution to speech intelligibility. The first approach uses regression to perform the visual-to-audio mapping, and then two further approaches are explored using vector quantisation techniques and classification models, with long-range temporal information incorporated at the feature and model-level. Excitation information, namely fundamental frequency and aperiodicity, is generated using artificial methods and joint-feature clustering approaches. Evaluations are first performed using mean squared error analyses and objective measures of speech intelligibility to refine the various system configurations, and then subjective listening tests are conducted to determine word-level accuracy, giving real intelligibility scores, of reconstructed speech. The best performing visual-to-audio domain mapping approach, using a clustering-and-classification framework with feature-level temporal encoding, is able to achieve audio-only intelligibility scores of 77 %, and audiovisual intelligibility scores of 84 %, on the GRID dataset. Furthermore, the methods are applied to a larger and more continuous dataset, with less favourable results, but with the belief that extensions to the work presented will yield a further increase in intelligibility.004University of East Angliahttps://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.743290https://ueaeprints.uea.ac.uk/67012/Electronic Thesis or Dissertation
collection	NDLTD
sources	NDLTD
topic	004
spellingShingle	004 Le Cornu, Thomas Reconstruction of intelligible audio speech from visual speech information
description	The aim of the work conducted in this thesis is to reconstruct audio speech signals using information which can be extracted solely from a visual stream of a speaker's face, with application for surveillance scenarios and silent speech interfaces. Visual speech is limited to that which can be seen of the mouth, lips, teeth, and tongue, where the visual articulators convey considerably less information than in the audio domain, leading to the task being difficult. Accordingly, the emphasis is on the reconstruction of intelligible speech, with less regard given to quality. A speech production model is used to reconstruct audio speech, where methods are presented in this work for generating or estimating the necessary parameters for the model. Three approaches are explored for producing spectral-envelope estimates from visual features as this parameter provides the greatest contribution to speech intelligibility. The first approach uses regression to perform the visual-to-audio mapping, and then two further approaches are explored using vector quantisation techniques and classification models, with long-range temporal information incorporated at the feature and model-level. Excitation information, namely fundamental frequency and aperiodicity, is generated using artificial methods and joint-feature clustering approaches. Evaluations are first performed using mean squared error analyses and objective measures of speech intelligibility to refine the various system configurations, and then subjective listening tests are conducted to determine word-level accuracy, giving real intelligibility scores, of reconstructed speech. The best performing visual-to-audio domain mapping approach, using a clustering-and-classification framework with feature-level temporal encoding, is able to achieve audio-only intelligibility scores of 77 %, and audiovisual intelligibility scores of 84 %, on the GRID dataset. Furthermore, the methods are applied to a larger and more continuous dataset, with less favourable results, but with the belief that extensions to the work presented will yield a further increase in intelligibility.
author	Le Cornu, Thomas
author_facet	Le Cornu, Thomas
author_sort	Le Cornu, Thomas
title	Reconstruction of intelligible audio speech from visual speech information
title_short	Reconstruction of intelligible audio speech from visual speech information
title_full	Reconstruction of intelligible audio speech from visual speech information
title_fullStr	Reconstruction of intelligible audio speech from visual speech information
title_full_unstemmed	Reconstruction of intelligible audio speech from visual speech information
title_sort	reconstruction of intelligible audio speech from visual speech information
publisher	University of East Anglia
publishDate	2016
url	https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.743290
work_keys_str_mv	AT lecornuthomas reconstructionofintelligibleaudiospeechfromvisualspeechinformation
_version_	1718996512251314176

Reconstruction of intelligible audio speech from visual speech information

Similar Items