On the Impact of Children's Emotional Speech on Acoustic and Language Models

The automatic recognition of children's speech is well known to be a challenge, and so is the influence of affect that is believed to downgrade performance of a speech recogniser. In this contribution, we investigate the combination of both phenomena. Extensive test runs are carried out for...

Full description

Bibliographic Details
Main Authors: Björn Schuller, Dino Seppi, Stefan Steidl, Anton Batliner
Format: Article
Language:English
Published: SpringerOpen 2010-01-01
Series:EURASIP Journal on Audio, Speech, and Music Processing
Online Access:http://dx.doi.org/10.1155/2010/783954
id doaj-e9fe9b06dc864bd98e60546974b29df7
record_format Article
spelling doaj-e9fe9b06dc864bd98e60546974b29df72020-11-25T01:18:43ZengSpringerOpenEURASIP Journal on Audio, Speech, and Music Processing1687-47141687-47222010-01-01201010.1155/2010/783954On the Impact of Children's Emotional Speech on Acoustic and Language ModelsBjörn SchullerDino SeppiStefan SteidlAnton BatlinerThe automatic recognition of children's speech is well known to be a challenge, and so is the influence of affect that is believed to downgrade performance of a speech recogniser. In this contribution, we investigate the combination of both phenomena. Extensive test runs are carried out for 1 k vocabulary continuous speech recognition on spontaneous motherese, emphatic, and angry children's speech as opposed to neutral speech. The experiments address the question how specific emotions influence word accuracy. In a first scenario, “emotional” speech recognisers are compared to a speech recogniser trained on neutral speech only. For this comparison, equal amounts of training data are used for each emotion-related state. In a second scenario, a “neutral” speech recogniser trained on large amounts of neutral speech is adapted by adding only some emotionally coloured data in the training process. The results show that emphatic and angry speech is recognised best—even better than neutral speech—and that the performance can be improved further by adaptation of the acoustic and linguistic models. In order to show the variability of emotional speech, we visualise the distribution of the four emotion-related states in the MFCC space by applying a Sammon transformation. http://dx.doi.org/10.1155/2010/783954
collection DOAJ
language English
format Article
sources DOAJ
author Björn Schuller
Dino Seppi
Stefan Steidl
Anton Batliner
spellingShingle Björn Schuller
Dino Seppi
Stefan Steidl
Anton Batliner
On the Impact of Children's Emotional Speech on Acoustic and Language Models
EURASIP Journal on Audio, Speech, and Music Processing
author_facet Björn Schuller
Dino Seppi
Stefan Steidl
Anton Batliner
author_sort Björn Schuller
title On the Impact of Children's Emotional Speech on Acoustic and Language Models
title_short On the Impact of Children's Emotional Speech on Acoustic and Language Models
title_full On the Impact of Children's Emotional Speech on Acoustic and Language Models
title_fullStr On the Impact of Children's Emotional Speech on Acoustic and Language Models
title_full_unstemmed On the Impact of Children's Emotional Speech on Acoustic and Language Models
title_sort on the impact of children's emotional speech on acoustic and language models
publisher SpringerOpen
series EURASIP Journal on Audio, Speech, and Music Processing
issn 1687-4714
1687-4722
publishDate 2010-01-01
description The automatic recognition of children's speech is well known to be a challenge, and so is the influence of affect that is believed to downgrade performance of a speech recogniser. In this contribution, we investigate the combination of both phenomena. Extensive test runs are carried out for 1 k vocabulary continuous speech recognition on spontaneous motherese, emphatic, and angry children's speech as opposed to neutral speech. The experiments address the question how specific emotions influence word accuracy. In a first scenario, “emotional” speech recognisers are compared to a speech recogniser trained on neutral speech only. For this comparison, equal amounts of training data are used for each emotion-related state. In a second scenario, a “neutral” speech recogniser trained on large amounts of neutral speech is adapted by adding only some emotionally coloured data in the training process. The results show that emphatic and angry speech is recognised best—even better than neutral speech—and that the performance can be improved further by adaptation of the acoustic and linguistic models. In order to show the variability of emotional speech, we visualise the distribution of the four emotion-related states in the MFCC space by applying a Sammon transformation.
url http://dx.doi.org/10.1155/2010/783954
work_keys_str_mv AT bjamp246rnschuller ontheimpactofchildren39semotionalspeechonacousticandlanguagemodels
AT dinoseppi ontheimpactofchildren39semotionalspeechonacousticandlanguagemodels
AT stefansteidl ontheimpactofchildren39semotionalspeechonacousticandlanguagemodels
AT antonbatliner ontheimpactofchildren39semotionalspeechonacousticandlanguagemodels
_version_ 1725140956759457792