Context Recognition Methods using Audio Signals for Human-Machine Interaction

abstract: Audio signals, such as speech and ambient sounds convey rich information pertaining to a user’s activity, mood or intent. Enabling machines to understand this contextual information is necessary to bridge the gap in human-machine interaction. This is challenging due to its subjective natur...

Full description

Bibliographic Details
Other Authors:	Shah, Mohit (Author)
Format:	Doctoral Thesis
Language:	English
Published:	2015
Subjects:	Electrical engineering Computer science articulation emotion recognition lifelogging speech analysis
Online Access:	http://hdl.handle.net/2286/R.I.29752

id	ndltd-asu.edu-item-29752
record_format	oai_dc
spelling	ndltd-asu.edu-item-297522018-06-22T03:06:01Z Context Recognition Methods using Audio Signals for Human-Machine Interaction abstract: Audio signals, such as speech and ambient sounds convey rich information pertaining to a user’s activity, mood or intent. Enabling machines to understand this contextual information is necessary to bridge the gap in human-machine interaction. This is challenging due to its subjective nature, hence, requiring sophisticated techniques. This dissertation presents a set of computational methods, that generalize well across different conditions, for speech-based applications involving emotion recognition and keyword detection, and ambient sounds-based applications such as lifelogging. The expression and perception of emotions varies across speakers and cultures, thus, determining features and classification methods that generalize well to different conditions is strongly desired. A latent topic models-based method is proposed to learn supra-segmental features from low-level acoustic descriptors. The derived features outperform state-of-the-art approaches over multiple databases. Cross-corpus studies are conducted to determine the ability of these features to generalize well across different databases. The proposed method is also applied to derive features from facial expressions; a multi-modal fusion overcomes the deficiencies of a speech only approach and further improves the recognition performance. Besides affecting the acoustic properties of speech, emotions have a strong influence over speech articulation kinematics. A learning approach, which constrains a classifier trained over acoustic descriptors, to also model articulatory data is proposed here. This method requires articulatory information only during the training stage, thus overcoming the challenges inherent to large-scale data collection, while simultaneously exploiting the correlations between articulation kinematics and acoustic descriptors to improve the accuracy of emotion recognition systems. Identifying context from ambient sounds in a lifelogging scenario requires feature extraction, segmentation and annotation techniques capable of efficiently handling long duration audio recordings; a complete framework for such applications is presented. The performance is evaluated on real world data and accompanied by a prototypical Android-based user interface. The proposed methods are also assessed in terms of computation and implementation complexity. Software and field programmable gate array based implementations are considered for emotion recognition, while virtual platforms are used to model the complexities of lifelogging. The derived metrics are used to determine the feasibility of these methods for applications requiring real-time capabilities and low power consumption. Dissertation/Thesis Shah, Mohit (Author) Spanias, Andreas (Advisor) Chakrabarti, Chaitali (Advisor) Berisha, Visar (Committee member) Turaga, Pavan (Committee member) Arizona State University (Publisher) Electrical engineering Computer science articulation emotion recognition lifelogging speech analysis eng 162 pages Doctoral Dissertation Electrical Engineering 2015 Doctoral Dissertation http://hdl.handle.net/2286/R.I.29752 http://rightsstatements.org/vocab/InC/1.0/ All Rights Reserved 2015
collection	NDLTD
language	English
format	Doctoral Thesis
sources	NDLTD
topic	Electrical engineering Computer science articulation emotion recognition lifelogging speech analysis
spellingShingle	Electrical engineering Computer science articulation emotion recognition lifelogging speech analysis Context Recognition Methods using Audio Signals for Human-Machine Interaction
description	abstract: Audio signals, such as speech and ambient sounds convey rich information pertaining to a user’s activity, mood or intent. Enabling machines to understand this contextual information is necessary to bridge the gap in human-machine interaction. This is challenging due to its subjective nature, hence, requiring sophisticated techniques. This dissertation presents a set of computational methods, that generalize well across different conditions, for speech-based applications involving emotion recognition and keyword detection, and ambient sounds-based applications such as lifelogging. The expression and perception of emotions varies across speakers and cultures, thus, determining features and classification methods that generalize well to different conditions is strongly desired. A latent topic models-based method is proposed to learn supra-segmental features from low-level acoustic descriptors. The derived features outperform state-of-the-art approaches over multiple databases. Cross-corpus studies are conducted to determine the ability of these features to generalize well across different databases. The proposed method is also applied to derive features from facial expressions; a multi-modal fusion overcomes the deficiencies of a speech only approach and further improves the recognition performance. Besides affecting the acoustic properties of speech, emotions have a strong influence over speech articulation kinematics. A learning approach, which constrains a classifier trained over acoustic descriptors, to also model articulatory data is proposed here. This method requires articulatory information only during the training stage, thus overcoming the challenges inherent to large-scale data collection, while simultaneously exploiting the correlations between articulation kinematics and acoustic descriptors to improve the accuracy of emotion recognition systems. Identifying context from ambient sounds in a lifelogging scenario requires feature extraction, segmentation and annotation techniques capable of efficiently handling long duration audio recordings; a complete framework for such applications is presented. The performance is evaluated on real world data and accompanied by a prototypical Android-based user interface. The proposed methods are also assessed in terms of computation and implementation complexity. Software and field programmable gate array based implementations are considered for emotion recognition, while virtual platforms are used to model the complexities of lifelogging. The derived metrics are used to determine the feasibility of these methods for applications requiring real-time capabilities and low power consumption. === Dissertation/Thesis === Doctoral Dissertation Electrical Engineering 2015
author2	Shah, Mohit (Author)
author_facet	Shah, Mohit (Author)
title	Context Recognition Methods using Audio Signals for Human-Machine Interaction
title_short	Context Recognition Methods using Audio Signals for Human-Machine Interaction
title_full	Context Recognition Methods using Audio Signals for Human-Machine Interaction
title_fullStr	Context Recognition Methods using Audio Signals for Human-Machine Interaction
title_full_unstemmed	Context Recognition Methods using Audio Signals for Human-Machine Interaction
title_sort	context recognition methods using audio signals for human-machine interaction
publishDate	2015
url	http://hdl.handle.net/2286/R.I.29752
_version_	1718700711324155904

Context Recognition Methods using Audio Signals for Human-Machine Interaction

Similar Items