Unsupervised analysis of behaviour dynamics

Human facial behaviour analysis is an important task in developing automatic Human-Computer Interaction systems, having received rapidly increased attention over the past two decades. Dynamics of facial behaviour convey important information (e.g., discriminating posed to spontaneous expressions) an...

Full description

Bibliographic Details
Main Author: Zafeiriou, Lazaros
Other Authors: Pantic, Maja
Published: Imperial College London 2016
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.712877
id ndltd-bl.uk-oai-ethos.bl.uk-712877
record_format oai_dc
collection NDLTD
sources NDLTD
topic 006.3
spellingShingle 006.3
Zafeiriou, Lazaros
Unsupervised analysis of behaviour dynamics
description Human facial behaviour analysis is an important task in developing automatic Human-Computer Interaction systems, having received rapidly increased attention over the past two decades. Dynamics of facial behaviour convey important information (e.g., discriminating posed to spontaneous expressions) and remain up to date a quite unexploited field. This thesis presents machine learning algorithms that focus on solving the relatively unexplored problem of extracting features that can efficiently and effectively capture the temporal dynamics of the behaviour, and can hence be also used for temporal alignment. The proposed methods are all unsupervised, i.e. they do not exploit any label information. The motivation behind the development of unsupervised algorithms lies in the fact that labelled/annotated data are really hard to obtain, since annotating behaviour dynamics is a very time demanding, expensive and labour intensive procedure. Additionally, in these models we incorporate temporal alignment enabling a joint temporal decomposition of two or more time-series into a common expression manifold by employing either low-dimensional sets of landmarks or raw pixel intensities. This is a challenging problem for many scientific disciplines in which the observation samples need to be aligned in time. In particular, this is mainly significant in terms of facial expressions, where the activation of facial muscles (Action Units) typically follows a set of predefined temporal phases. The methods that we propose for capturing the dynamics of facial expressions use Component Analysis (CA) which is a fundamental step in most computer vision applications, especially in terms of reducing the usually high-dimensional input data in a meaningful manner by preserving a certain function. These CA methodologies can be distinguished in deterministic and probabilistic techniques. In deterministic CA, the noise cannot be modelled and these methods they do not provide prior information. On the other hand, probabilistic CA is a very powerful framework that naturally allows the incorporation of noise and a-priori knowledge in the developed models. A significant contribution of our work lies in proposing an Expectation Maximization (EM) algorithm for performing inference in a probabilistic formulation of Slow Feature Analysis (SFA) and extending it in order to handle more than one time varying data sequences. Moreover, we demonstrate that the probabilistic SFA (EM-SFA) algorithm that discovers the common slowest varying latent space of multiple sequences can be combined with Dynamic Time Warping (DTW) techniques for robust sequence time-alignment. Most of the unsupervised learning techniques such as Principal Components Analysis (PCA) enforce only a weak orthogonality constraint, resulting in a very distributed representation that uses cancellations to generate variability. This results to a holistic representation which makes the latent features difficult to be interpreted. For alleviating this, a group of unsupervised learning algorithms known as Non-negative Matrix Factorization (NMF), have been proposed. These algorithms enforce non-negativity constraints resulting to a part-based representation, since they allow only additive and not subtractive combinations. Another major contribution of this thesis lies in proposing a model that combines the properties of temporal slowness and nonnegative parts-based learning into a common framework that aims to learn slow varying parts-based representations of time varying sequences. The proposed representations can be used in order to capture the underlying dynamics of temporal phenomena such as facial behaviour. Furthermore, we extend the above framework in order to align two visual sequences that display the same dynamic phenomenon by proposing a novel joint NMF. The proposed framework enables a joint temporal decomposition of two non-negative time-series into a non-negative shared latent space, where they can be temporally aligned. The proposed method is tailored for the temporal alignment of facial events since it is able to discover the facial parts that are jointly activated in the sequences along with their temporal activation envelope. We demonstrate the power of the proposed decompositions in unsupervised analysis of dynamic visual phenomena, as well as temporal alignment of facial behaviour. The predominant strategy for facial expression analysis and temporal analysis of facial events is the following: a generic facial landmarks tracker, usually trained on thousands of carefully annotated examples, is applied to track the landmark points, and then analysis is performed using mostly the shape and more rarely the facial texture. In this thesis, we challenge the above framework by showing that is feasible to perform joint landmarks localization and temporal analysis of behavioural sequence with the use of a simple face detector and a simple shape model. To this end, we formulate a generative model which jointly describes the data and also captures temporal dependencies by incorporating an autoregressive chain in the latent space. We also extend this model by integrating temporal alignment process in order to align two unsynchronized sequences of observations displaying highly deformable texture-varying objects. The resulting model is the first to perform simultaneous spatial and temporal alignment showing that by treating the problems of deformable spatial and temporal alignment jointly, we achieve better results than considering the problems independent.
author2 Pantic, Maja
author_facet Pantic, Maja
Zafeiriou, Lazaros
author Zafeiriou, Lazaros
author_sort Zafeiriou, Lazaros
title Unsupervised analysis of behaviour dynamics
title_short Unsupervised analysis of behaviour dynamics
title_full Unsupervised analysis of behaviour dynamics
title_fullStr Unsupervised analysis of behaviour dynamics
title_full_unstemmed Unsupervised analysis of behaviour dynamics
title_sort unsupervised analysis of behaviour dynamics
publisher Imperial College London
publishDate 2016
url http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.712877
work_keys_str_mv AT zafeirioulazaros unsupervisedanalysisofbehaviourdynamics
_version_ 1718725955939205120
spelling ndltd-bl.uk-oai-ethos.bl.uk-7128772018-08-21T03:26:07ZUnsupervised analysis of behaviour dynamicsZafeiriou, LazarosPantic, Maja2016Human facial behaviour analysis is an important task in developing automatic Human-Computer Interaction systems, having received rapidly increased attention over the past two decades. Dynamics of facial behaviour convey important information (e.g., discriminating posed to spontaneous expressions) and remain up to date a quite unexploited field. This thesis presents machine learning algorithms that focus on solving the relatively unexplored problem of extracting features that can efficiently and effectively capture the temporal dynamics of the behaviour, and can hence be also used for temporal alignment. The proposed methods are all unsupervised, i.e. they do not exploit any label information. The motivation behind the development of unsupervised algorithms lies in the fact that labelled/annotated data are really hard to obtain, since annotating behaviour dynamics is a very time demanding, expensive and labour intensive procedure. Additionally, in these models we incorporate temporal alignment enabling a joint temporal decomposition of two or more time-series into a common expression manifold by employing either low-dimensional sets of landmarks or raw pixel intensities. This is a challenging problem for many scientific disciplines in which the observation samples need to be aligned in time. In particular, this is mainly significant in terms of facial expressions, where the activation of facial muscles (Action Units) typically follows a set of predefined temporal phases. The methods that we propose for capturing the dynamics of facial expressions use Component Analysis (CA) which is a fundamental step in most computer vision applications, especially in terms of reducing the usually high-dimensional input data in a meaningful manner by preserving a certain function. These CA methodologies can be distinguished in deterministic and probabilistic techniques. In deterministic CA, the noise cannot be modelled and these methods they do not provide prior information. On the other hand, probabilistic CA is a very powerful framework that naturally allows the incorporation of noise and a-priori knowledge in the developed models. A significant contribution of our work lies in proposing an Expectation Maximization (EM) algorithm for performing inference in a probabilistic formulation of Slow Feature Analysis (SFA) and extending it in order to handle more than one time varying data sequences. Moreover, we demonstrate that the probabilistic SFA (EM-SFA) algorithm that discovers the common slowest varying latent space of multiple sequences can be combined with Dynamic Time Warping (DTW) techniques for robust sequence time-alignment. Most of the unsupervised learning techniques such as Principal Components Analysis (PCA) enforce only a weak orthogonality constraint, resulting in a very distributed representation that uses cancellations to generate variability. This results to a holistic representation which makes the latent features difficult to be interpreted. For alleviating this, a group of unsupervised learning algorithms known as Non-negative Matrix Factorization (NMF), have been proposed. These algorithms enforce non-negativity constraints resulting to a part-based representation, since they allow only additive and not subtractive combinations. Another major contribution of this thesis lies in proposing a model that combines the properties of temporal slowness and nonnegative parts-based learning into a common framework that aims to learn slow varying parts-based representations of time varying sequences. The proposed representations can be used in order to capture the underlying dynamics of temporal phenomena such as facial behaviour. Furthermore, we extend the above framework in order to align two visual sequences that display the same dynamic phenomenon by proposing a novel joint NMF. The proposed framework enables a joint temporal decomposition of two non-negative time-series into a non-negative shared latent space, where they can be temporally aligned. The proposed method is tailored for the temporal alignment of facial events since it is able to discover the facial parts that are jointly activated in the sequences along with their temporal activation envelope. We demonstrate the power of the proposed decompositions in unsupervised analysis of dynamic visual phenomena, as well as temporal alignment of facial behaviour. The predominant strategy for facial expression analysis and temporal analysis of facial events is the following: a generic facial landmarks tracker, usually trained on thousands of carefully annotated examples, is applied to track the landmark points, and then analysis is performed using mostly the shape and more rarely the facial texture. In this thesis, we challenge the above framework by showing that is feasible to perform joint landmarks localization and temporal analysis of behavioural sequence with the use of a simple face detector and a simple shape model. To this end, we formulate a generative model which jointly describes the data and also captures temporal dependencies by incorporating an autoregressive chain in the latent space. We also extend this model by integrating temporal alignment process in order to align two unsynchronized sequences of observations displaying highly deformable texture-varying objects. The resulting model is the first to perform simultaneous spatial and temporal alignment showing that by treating the problems of deformable spatial and temporal alignment jointly, we achieve better results than considering the problems independent.006.3Imperial College Londonhttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.712877http://hdl.handle.net/10044/1/45546Electronic Thesis or Dissertation