From pixels to gestures: learning visual representations for human analysis in color and depth data sequences

The visual analysis of humans from images is an important topic of interest due to its relevance to many computer vision applications likepedestrian detection, monitoring and surveillance, human-computer interaction, e-health or content-based image retrieval, among others. In this dissertation we...

Full description

Bibliographic Details
Main Author:	Antonio Hernandez-Vela
Format:	Article
Language:	English
Published:	Computer Vision Center Press 2015-12-01
Series:	ELCVIA Electronic Letters on Computer Vision and Image Analysis
Subjects:	Computer Vision Pattern recognition Statistical Pattern Recognition Classification and Clusttering Separation and Segmentation Face and Gesture
Online Access:	https://elcvia.cvc.uab.es/article/view/723

id	doaj-506fad1af5114c07b9c4267e213e72ce
record_format	Article
spelling	doaj-506fad1af5114c07b9c4267e213e72ce2021-09-18T12:38:57ZengComputer Vision Center PressELCVIA Electronic Letters on Computer Vision and Image Analysis1577-50972015-12-0114310.5565/rev/elcvia.723270From pixels to gestures: learning visual representations for human analysis in color and depth data sequencesAntonio Hernandez-Vela0Universitat de Barcelona & Computer Vision Center The visual analysis of humans from images is an important topic of interest due to its relevance to many computer vision applications likepedestrian detection, monitoring and surveillance, human-computer interaction, e-health or content-based image retrieval, among others. In this dissertation we are interested in learning different visual representations of the human body that are helpful for the visual analysis of humans in images and video sequences. To that end, we analyze both RGB and depth image modalities and address the problem from three different research lines, at different levels of abstraction; from pixels to gestures: human segmentation, human pose estimation and gesture recognition. First, we show how binary segmentation (object vs. background) of the human body in image sequences is helpful to remove all the background clutter present in the scene. The presented method, based on Graph cuts optimization, enforces spatio-temporal consistency of the produced segmentation masks among consecutive frames. Secondly, we present a framework for multi-label segmentation for obtaining much more detailed segmentation masks: instead of just obtaining a binary representation separating the human body from the background, finer segmentation masks can be obtained separating the different body parts. At a higher level of abstraction, we aim for a simpler yet descriptive representation of the human body. Human pose estimation methods usually rely on skeletal models of the human body, formed by segments (or rectangles) that represent the body limbs, appropriately connected following the kinematic constraints of the human body. In practice, such skeletal models must fulfill some constraints in order to allow for efficient inference, while actually limiting the expressiveness of the model. In order to cope with this, we introduce a top-down approach for predicting the position of the body parts in the model, using a mid-level part representation based on Poselets. Finally, we propose a framework for gesture recognition based on the bag of visual words framework. We leverage the benefits of RGB and depth image modalities by combining modality-specific visual vocabularies in a late fusion fashion. A new rotation-variant depth descriptor is presented, yielding better results than other state-of-the-art descriptors. Moreover, spatio-temporal pyramids are used to encode rough spatial and temporal structure. In addition, we present a probabilistic reformulation of Dynamic Time Warping for gesture segmentation in video sequences. A Gaussian-based probabilistic model of a gesture is learnt, implicitly encoding possible deformations in both spatial and time domains. https://elcvia.cvc.uab.es/article/view/723Computer VisionPattern recognitionStatistical Pattern RecognitionClassification and ClustteringSeparation and SegmentationFace and Gesture
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Antonio Hernandez-Vela
spellingShingle	Antonio Hernandez-Vela From pixels to gestures: learning visual representations for human analysis in color and depth data sequences ELCVIA Electronic Letters on Computer Vision and Image Analysis Computer Vision Pattern recognition Statistical Pattern Recognition Classification and Clusttering Separation and Segmentation Face and Gesture
author_facet	Antonio Hernandez-Vela
author_sort	Antonio Hernandez-Vela
title	From pixels to gestures: learning visual representations for human analysis in color and depth data sequences
title_short	From pixels to gestures: learning visual representations for human analysis in color and depth data sequences
title_full	From pixels to gestures: learning visual representations for human analysis in color and depth data sequences
title_fullStr	From pixels to gestures: learning visual representations for human analysis in color and depth data sequences
title_full_unstemmed	From pixels to gestures: learning visual representations for human analysis in color and depth data sequences
title_sort	from pixels to gestures: learning visual representations for human analysis in color and depth data sequences
publisher	Computer Vision Center Press
series	ELCVIA Electronic Letters on Computer Vision and Image Analysis
issn	1577-5097
publishDate	2015-12-01
description	The visual analysis of humans from images is an important topic of interest due to its relevance to many computer vision applications likepedestrian detection, monitoring and surveillance, human-computer interaction, e-health or content-based image retrieval, among others. In this dissertation we are interested in learning different visual representations of the human body that are helpful for the visual analysis of humans in images and video sequences. To that end, we analyze both RGB and depth image modalities and address the problem from three different research lines, at different levels of abstraction; from pixels to gestures: human segmentation, human pose estimation and gesture recognition. First, we show how binary segmentation (object vs. background) of the human body in image sequences is helpful to remove all the background clutter present in the scene. The presented method, based on Graph cuts optimization, enforces spatio-temporal consistency of the produced segmentation masks among consecutive frames. Secondly, we present a framework for multi-label segmentation for obtaining much more detailed segmentation masks: instead of just obtaining a binary representation separating the human body from the background, finer segmentation masks can be obtained separating the different body parts. At a higher level of abstraction, we aim for a simpler yet descriptive representation of the human body. Human pose estimation methods usually rely on skeletal models of the human body, formed by segments (or rectangles) that represent the body limbs, appropriately connected following the kinematic constraints of the human body. In practice, such skeletal models must fulfill some constraints in order to allow for efficient inference, while actually limiting the expressiveness of the model. In order to cope with this, we introduce a top-down approach for predicting the position of the body parts in the model, using a mid-level part representation based on Poselets. Finally, we propose a framework for gesture recognition based on the bag of visual words framework. We leverage the benefits of RGB and depth image modalities by combining modality-specific visual vocabularies in a late fusion fashion. A new rotation-variant depth descriptor is presented, yielding better results than other state-of-the-art descriptors. Moreover, spatio-temporal pyramids are used to encode rough spatial and temporal structure. In addition, we present a probabilistic reformulation of Dynamic Time Warping for gesture segmentation in video sequences. A Gaussian-based probabilistic model of a gesture is learnt, implicitly encoding possible deformations in both spatial and time domains.
topic	Computer Vision Pattern recognition Statistical Pattern Recognition Classification and Clusttering Separation and Segmentation Face and Gesture
url	https://elcvia.cvc.uab.es/article/view/723
work_keys_str_mv	AT antoniohernandezvela frompixelstogestureslearningvisualrepresentationsforhumananalysisincoloranddepthdatasequences
_version_	1717376976848683008

From pixels to gestures: learning visual representations for human analysis in color and depth data sequences

Similar Items