Alternative Visual Units for an Optimized Phoneme-Based Lipreading System

Lipreading is understanding speech from observed lip movements. An observed series of lip motions is an ordered sequence of visual lip gestures. These gestures are commonly known, but as yet are not formally defined, as ‘visemes’. In this article, we describe a structured approac...

Full description

Bibliographic Details
Main Authors:	Helen L. Bear, Richard Harvey
Format:	Article
Language:	English
Published:	MDPI AG 2019-09-01
Series:	Applied Sciences
Subjects:	visual speech lipreading recognition audio-visual speech classification viseme phoneme transfer learning
Online Access:	https://www.mdpi.com/2076-3417/9/18/3870

id	doaj-473d747f3bd14fd3a00b6c000c0b326b
record_format	Article
spelling	doaj-473d747f3bd14fd3a00b6c000c0b326b2020-11-25T02:48:02ZengMDPI AGApplied Sciences2076-34172019-09-01918387010.3390/app9183870app9183870Alternative Visual Units for an Optimized Phoneme-Based Lipreading SystemHelen L. Bear0Richard Harvey1School of Electronic Engineering and Computer Science, Queen Mary University of London, London E1 4NS, UKSchool of Computing Sciences, University of East Anglia, Norwich NR4 7TJ, UKLipreading is understanding speech from observed lip movements. An observed series of lip motions is an ordered sequence of visual lip gestures. These gestures are commonly known, but as yet are not formally defined, as ‘visemes’. In this article, we describe a structured approach which allows us to create speaker-dependent visemes with a fixed number of visemes within each set. We create sets of visemes for sizes two to 45. Each set of visemes is based upon clustering phonemes, thus each set has a unique phoneme-to-viseme mapping. We first present an experiment using these maps and the Resource Management Audio-Visual (RMAV) dataset which shows the effect of changing the viseme map size in speaker-dependent machine lipreading and demonstrate that word recognition with phoneme classifiers is possible. Furthermore, we show that there are intermediate units between visemes and phonemes which are better still. Second, we present a novel two-pass training scheme for phoneme classifiers. This approach uses our new intermediary visual units from our first experiment in the first pass as classifiers; before using the phoneme-to-viseme maps, we retrain these into phoneme classifiers. This method significantly improves on previous lipreading results with RMAV speakers.https://www.mdpi.com/2076-3417/9/18/3870visual speechlipreadingrecognitionaudio-visualspeechclassificationvisemephonemetransfer learning
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Helen L. Bear Richard Harvey
spellingShingle	Helen L. Bear Richard Harvey Alternative Visual Units for an Optimized Phoneme-Based Lipreading System Applied Sciences visual speech lipreading recognition audio-visual speech classification viseme phoneme transfer learning
author_facet	Helen L. Bear Richard Harvey
author_sort	Helen L. Bear
title	Alternative Visual Units for an Optimized Phoneme-Based Lipreading System
title_short	Alternative Visual Units for an Optimized Phoneme-Based Lipreading System
title_full	Alternative Visual Units for an Optimized Phoneme-Based Lipreading System
title_fullStr	Alternative Visual Units for an Optimized Phoneme-Based Lipreading System
title_full_unstemmed	Alternative Visual Units for an Optimized Phoneme-Based Lipreading System
title_sort	alternative visual units for an optimized phoneme-based lipreading system
publisher	MDPI AG
series	Applied Sciences
issn	2076-3417
publishDate	2019-09-01
description	Lipreading is understanding speech from observed lip movements. An observed series of lip motions is an ordered sequence of visual lip gestures. These gestures are commonly known, but as yet are not formally defined, as ‘visemes’. In this article, we describe a structured approach which allows us to create speaker-dependent visemes with a fixed number of visemes within each set. We create sets of visemes for sizes two to 45. Each set of visemes is based upon clustering phonemes, thus each set has a unique phoneme-to-viseme mapping. We first present an experiment using these maps and the Resource Management Audio-Visual (RMAV) dataset which shows the effect of changing the viseme map size in speaker-dependent machine lipreading and demonstrate that word recognition with phoneme classifiers is possible. Furthermore, we show that there are intermediate units between visemes and phonemes which are better still. Second, we present a novel two-pass training scheme for phoneme classifiers. This approach uses our new intermediary visual units from our first experiment in the first pass as classifiers; before using the phoneme-to-viseme maps, we retrain these into phoneme classifiers. This method significantly improves on previous lipreading results with RMAV speakers.
topic	visual speech lipreading recognition audio-visual speech classification viseme phoneme transfer learning
url	https://www.mdpi.com/2076-3417/9/18/3870
work_keys_str_mv	AT helenlbear alternativevisualunitsforanoptimizedphonemebasedlipreadingsystem AT richardharvey alternativevisualunitsforanoptimizedphonemebasedlipreadingsystem
_version_	1724750511686549504

Alternative Visual Units for an Optimized Phoneme-Based Lipreading System

Similar Items