Context Recognition Methods using Audio Signals for Human-Machine Interaction

abstract: Audio signals, such as speech and ambient sounds convey rich information pertaining to a user’s activity, mood or intent. Enabling machines to understand this contextual information is necessary to bridge the gap in human-machine interaction. This is challenging due to its subjective natur...

Full description

Bibliographic Details
Other Authors: Shah, Mohit (Author)
Format: Doctoral Thesis
Language:English
Published: 2015
Subjects:
Online Access:http://hdl.handle.net/2286/R.I.29752
id ndltd-asu.edu-item-29752
record_format oai_dc
spelling ndltd-asu.edu-item-297522018-06-22T03:06:01Z Context Recognition Methods using Audio Signals for Human-Machine Interaction abstract: Audio signals, such as speech and ambient sounds convey rich information pertaining to a user’s activity, mood or intent. Enabling machines to understand this contextual information is necessary to bridge the gap in human-machine interaction. This is challenging due to its subjective nature, hence, requiring sophisticated techniques. This dissertation presents a set of computational methods, that generalize well across different conditions, for speech-based applications involving emotion recognition and keyword detection, and ambient sounds-based applications such as lifelogging. The expression and perception of emotions varies across speakers and cultures, thus, determining features and classification methods that generalize well to different conditions is strongly desired. A latent topic models-based method is proposed to learn supra-segmental features from low-level acoustic descriptors. The derived features outperform state-of-the-art approaches over multiple databases. Cross-corpus studies are conducted to determine the ability of these features to generalize well across different databases. The proposed method is also applied to derive features from facial expressions; a multi-modal fusion overcomes the deficiencies of a speech only approach and further improves the recognition performance. Besides affecting the acoustic properties of speech, emotions have a strong influence over speech articulation kinematics. A learning approach, which constrains a classifier trained over acoustic descriptors, to also model articulatory data is proposed here. This method requires articulatory information only during the training stage, thus overcoming the challenges inherent to large-scale data collection, while simultaneously exploiting the correlations between articulation kinematics and acoustic descriptors to improve the accuracy of emotion recognition systems. Identifying context from ambient sounds in a lifelogging scenario requires feature extraction, segmentation and annotation techniques capable of efficiently handling long duration audio recordings; a complete framework for such applications is presented. The performance is evaluated on real world data and accompanied by a prototypical Android-based user interface. The proposed methods are also assessed in terms of computation and implementation complexity. Software and field programmable gate array based implementations are considered for emotion recognition, while virtual platforms are used to model the complexities of lifelogging. The derived metrics are used to determine the feasibility of these methods for applications requiring real-time capabilities and low power consumption. Dissertation/Thesis Shah, Mohit (Author) Spanias, Andreas (Advisor) Chakrabarti, Chaitali (Advisor) Berisha, Visar (Committee member) Turaga, Pavan (Committee member) Arizona State University (Publisher) Electrical engineering Computer science articulation emotion recognition lifelogging speech analysis eng 162 pages Doctoral Dissertation Electrical Engineering 2015 Doctoral Dissertation http://hdl.handle.net/2286/R.I.29752 http://rightsstatements.org/vocab/InC/1.0/ All Rights Reserved 2015
collection NDLTD
language English
format Doctoral Thesis
sources NDLTD
topic Electrical engineering
Computer science
articulation
emotion recognition
lifelogging
speech analysis
spellingShingle Electrical engineering
Computer science
articulation
emotion recognition
lifelogging
speech analysis
Context Recognition Methods using Audio Signals for Human-Machine Interaction
description abstract: Audio signals, such as speech and ambient sounds convey rich information pertaining to a user’s activity, mood or intent. Enabling machines to understand this contextual information is necessary to bridge the gap in human-machine interaction. This is challenging due to its subjective nature, hence, requiring sophisticated techniques. This dissertation presents a set of computational methods, that generalize well across different conditions, for speech-based applications involving emotion recognition and keyword detection, and ambient sounds-based applications such as lifelogging. The expression and perception of emotions varies across speakers and cultures, thus, determining features and classification methods that generalize well to different conditions is strongly desired. A latent topic models-based method is proposed to learn supra-segmental features from low-level acoustic descriptors. The derived features outperform state-of-the-art approaches over multiple databases. Cross-corpus studies are conducted to determine the ability of these features to generalize well across different databases. The proposed method is also applied to derive features from facial expressions; a multi-modal fusion overcomes the deficiencies of a speech only approach and further improves the recognition performance. Besides affecting the acoustic properties of speech, emotions have a strong influence over speech articulation kinematics. A learning approach, which constrains a classifier trained over acoustic descriptors, to also model articulatory data is proposed here. This method requires articulatory information only during the training stage, thus overcoming the challenges inherent to large-scale data collection, while simultaneously exploiting the correlations between articulation kinematics and acoustic descriptors to improve the accuracy of emotion recognition systems. Identifying context from ambient sounds in a lifelogging scenario requires feature extraction, segmentation and annotation techniques capable of efficiently handling long duration audio recordings; a complete framework for such applications is presented. The performance is evaluated on real world data and accompanied by a prototypical Android-based user interface. The proposed methods are also assessed in terms of computation and implementation complexity. Software and field programmable gate array based implementations are considered for emotion recognition, while virtual platforms are used to model the complexities of lifelogging. The derived metrics are used to determine the feasibility of these methods for applications requiring real-time capabilities and low power consumption. === Dissertation/Thesis === Doctoral Dissertation Electrical Engineering 2015
author2 Shah, Mohit (Author)
author_facet Shah, Mohit (Author)
title Context Recognition Methods using Audio Signals for Human-Machine Interaction
title_short Context Recognition Methods using Audio Signals for Human-Machine Interaction
title_full Context Recognition Methods using Audio Signals for Human-Machine Interaction
title_fullStr Context Recognition Methods using Audio Signals for Human-Machine Interaction
title_full_unstemmed Context Recognition Methods using Audio Signals for Human-Machine Interaction
title_sort context recognition methods using audio signals for human-machine interaction
publishDate 2015
url http://hdl.handle.net/2286/R.I.29752
_version_ 1718700711324155904