Advanced natural language processing for improved prosody in text-to-speech synthesis / G. I. Schlünz

Text-to-speech synthesis enables the speech-impeded user of an augmentative and alternative communication system to partake in any conversation on any topic, because it can produce dynamic content. Current synthetic voices do not sound very natural, however, lacking in the areas of emphasis and emot...

Full description

Bibliographic Details
Main Author:	Schlünz, Georg Isaac
Language:	en
Published:	North-West University 2014
Subjects:	Natural language processing Text-to-speech synthesis Prosody Discourse Information structure Affect OCC model E-motif
Online Access:	http://hdl.handle.net/10394/10634

id	ndltd-NWUBOLOKA1-oai-dspace.nwu.ac.za-10394-10634
record_format	oai_dc
spelling	ndltd-NWUBOLOKA1-oai-dspace.nwu.ac.za-10394-106342014-09-30T04:06:42ZAdvanced natural language processing for improved prosody in text-to-speech synthesis / G. I. SchlünzSchlünz, Georg IsaacNatural language processingText-to-speech synthesisProsodyDiscourseInformation structureAffectOCC modelE-motifText-to-speech synthesis enables the speech-impeded user of an augmentative and alternative communication system to partake in any conversation on any topic, because it can produce dynamic content. Current synthetic voices do not sound very natural, however, lacking in the areas of emphasis and emotion. These qualities are furthermore important to convey meaning and intent beyond that which can be achieved by the vocabulary of words only. Put differently, speech synthesis requires a more comprehensive analysis of its text input beyond the word level to infer the meaning and intent that elicit emphasis and emotion. The synthesised speech then needs to imitate the effects that these textual factors have on the acoustics of human speech. This research addresses these challenges by commencing with a literature study on the state of the art in the fields of natural language processing, text-to-speech synthesis and speech prosody. It is noted that the higher linguistic levels of discourse, information structure and affect are necessary for the text analysis to shape the prosody appropriately for more natural synthesised speech. Discourse and information structure account for meaning, intent and emphasis, and affect formalises the modelling of emotion. The OCC model is shown to be a suitable point of departure for a new model of affect that can leverage the higher linguistic levels. The audiobook is presented as a text and speech resource for the modelling of discourse, information structure and affect because its narrative structure is prosodically richer than the random constitution of a traditional text-to-speech corpus. A set of audiobooks are selected and phonetically aligned for subsequent investigation. The new model of discourse, information structure and affect, called e-motif, is developed to take advantage of the audiobook text. It is a subjective model that does not specify any particular belief system in order to appraise its emotions, but defines only anonymous affect states. Its cognitive and social features rely heavily on the coreference resolution of the text, but this process is found not to be accurate enough to produce usable features values. The research concludes with an experimental investigation of the influence of the e-motif features on human speech and synthesised speech. The aligned audiobook speech is inspected for prosodic correlates of the cognitive and social features, revealing that some activity occurs in the into national domain. However, when the aligned audiobook speech is used in the training of a synthetic voice, the e-motif effects are overshadowed by those of structural features that come standard in the voice building framework.PhD (Information Technology), North-West University, Vaal Triangle Campus, 2014North-West University2014-06-09T11:35:21Z2014-06-09T11:35:21Z2014Thesishttp://hdl.handle.net/10394/10634en
collection	NDLTD
language	en
sources	NDLTD
topic	Natural language processing Text-to-speech synthesis Prosody Discourse Information structure Affect OCC model E-motif
spellingShingle	Natural language processing Text-to-speech synthesis Prosody Discourse Information structure Affect OCC model E-motif Schlünz, Georg Isaac Advanced natural language processing for improved prosody in text-to-speech synthesis / G. I. Schlünz
description	Text-to-speech synthesis enables the speech-impeded user of an augmentative and alternative communication system to partake in any conversation on any topic, because it can produce dynamic content. Current synthetic voices do not sound very natural, however, lacking in the areas of emphasis and emotion. These qualities are furthermore important to convey meaning and intent beyond that which can be achieved by the vocabulary of words only. Put differently, speech synthesis requires a more comprehensive analysis of its text input beyond the word level to infer the meaning and intent that elicit emphasis and emotion. The synthesised speech then needs to imitate the effects that these textual factors have on the acoustics of human speech. This research addresses these challenges by commencing with a literature study on the state of the art in the fields of natural language processing, text-to-speech synthesis and speech prosody. It is noted that the higher linguistic levels of discourse, information structure and affect are necessary for the text analysis to shape the prosody appropriately for more natural synthesised speech. Discourse and information structure account for meaning, intent and emphasis, and affect formalises the modelling of emotion. The OCC model is shown to be a suitable point of departure for a new model of affect that can leverage the higher linguistic levels. The audiobook is presented as a text and speech resource for the modelling of discourse, information structure and affect because its narrative structure is prosodically richer than the random constitution of a traditional text-to-speech corpus. A set of audiobooks are selected and phonetically aligned for subsequent investigation. The new model of discourse, information structure and affect, called e-motif, is developed to take advantage of the audiobook text. It is a subjective model that does not specify any particular belief system in order to appraise its emotions, but defines only anonymous affect states. Its cognitive and social features rely heavily on the coreference resolution of the text, but this process is found not to be accurate enough to produce usable features values. The research concludes with an experimental investigation of the influence of the e-motif features on human speech and synthesised speech. The aligned audiobook speech is inspected for prosodic correlates of the cognitive and social features, revealing that some activity occurs in the into national domain. However, when the aligned audiobook speech is used in the training of a synthetic voice, the e-motif effects are overshadowed by those of structural features that come standard in the voice building framework. === PhD (Information Technology), North-West University, Vaal Triangle Campus, 2014
author	Schlünz, Georg Isaac
author_facet	Schlünz, Georg Isaac
author_sort	Schlünz, Georg Isaac
title	Advanced natural language processing for improved prosody in text-to-speech synthesis / G. I. Schlünz
title_short	Advanced natural language processing for improved prosody in text-to-speech synthesis / G. I. Schlünz
title_full	Advanced natural language processing for improved prosody in text-to-speech synthesis / G. I. Schlünz
title_fullStr	Advanced natural language processing for improved prosody in text-to-speech synthesis / G. I. Schlünz
title_full_unstemmed	Advanced natural language processing for improved prosody in text-to-speech synthesis / G. I. Schlünz
title_sort	advanced natural language processing for improved prosody in text-to-speech synthesis / g. i. schlünz
publisher	North-West University
publishDate	2014
url	http://hdl.handle.net/10394/10634
work_keys_str_mv	AT schlunzgeorgisaac advancednaturallanguageprocessingforimprovedprosodyintexttospeechsynthesisgischlunz
_version_	1716715520008388608

Advanced natural language processing for improved prosody in text-to-speech synthesis / G. I. Schlünz

Similar Items