Learning space-time structures for action recognition and localization

In this thesis the problem of automatic human action recognition and localization in videos is studied. In this problem, our goal is to recognize the category of the human action that is happening in the video, and also to localize the action in space and/or time. This problem is challenging due to...

Full description

Bibliographic Details
Main Author:	Ma, Shugao
Language:	en_US
Published:	2016
Subjects:	Computer science Action localization Action recognition Computer vision Deep learning Machine learning Space-time structures
Online Access:	https://hdl.handle.net/2144/17720

id	ndltd-bu.edu-oai-open.bu.edu-2144-17720
record_format	oai_dc
spelling	ndltd-bu.edu-oai-open.bu.edu-2144-177202019-03-29T06:43:18Z Learning space-time structures for action recognition and localization Ma, Shugao Computer science Action localization Action recognition Computer vision Deep learning Machine learning Space-time structures In this thesis the problem of automatic human action recognition and localization in videos is studied. In this problem, our goal is to recognize the category of the human action that is happening in the video, and also to localize the action in space and/or time. This problem is challenging due to the complexity of the human actions, the large intra-class variations and the distraction of backgrounds. Human actions are inherently structured patterns of body movements. However, past works are inadequate in learning the space-time structures in human actions and exploring them for better recognition and localization. In this thesis new methods are proposed that exploit such space-time structures for effective human action recognition and localization in videos, including sports videos, YouTube videos, TV programs and movies. A new local space-time video representation, the hierarchical Space-Time Segments, is first proposed. Using this new video representation, ensembles of hierarchical spatio-temporal trees, discovered directly from the training videos, are constructed to model the hierarchical, spatial and temporal structures of human actions. This proposed approach achieves promising performances in action recognition and localization on challenging benchmark datasets. Moreover, the discovered trees show good cross-dataset generalizability: trees learned on one dataset can be used to recognize and localize similar actions in another dataset. To handle large scale data, a deep model is explored that learns temporal progression of the actions using Long Short Term Memory (LSTM), which is a type of Recurrent Neural Network (RNN). Two novel ranking losses are proposed to train the model to better capture the temporal structures of actions for accurate action recognition and temporal localization. This model achieves state-of-art performance on a large scale video dataset. A deep model usually employs a Convolutional Neural Network (CNN) to learn visual features from video frames. The problem of utilizing web action images for training a Convolutional Neural Network (CNN) is also studied: training CNN typically requires a large number of training videos, but the findings of this study show that web action images can be utilized as additional training data to significantly reduce the burden of video training data collection. 2016-08-17T13:58:57Z 2016-08-17T13:58:57Z 2016 2016-08-12T01:28:50Z Thesis/Dissertation https://hdl.handle.net/2144/17720 en_US Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/
collection	NDLTD
language	en_US
sources	NDLTD
topic	Computer science Action localization Action recognition Computer vision Deep learning Machine learning Space-time structures
spellingShingle	Computer science Action localization Action recognition Computer vision Deep learning Machine learning Space-time structures Ma, Shugao Learning space-time structures for action recognition and localization
description	In this thesis the problem of automatic human action recognition and localization in videos is studied. In this problem, our goal is to recognize the category of the human action that is happening in the video, and also to localize the action in space and/or time. This problem is challenging due to the complexity of the human actions, the large intra-class variations and the distraction of backgrounds. Human actions are inherently structured patterns of body movements. However, past works are inadequate in learning the space-time structures in human actions and exploring them for better recognition and localization. In this thesis new methods are proposed that exploit such space-time structures for effective human action recognition and localization in videos, including sports videos, YouTube videos, TV programs and movies. A new local space-time video representation, the hierarchical Space-Time Segments, is first proposed. Using this new video representation, ensembles of hierarchical spatio-temporal trees, discovered directly from the training videos, are constructed to model the hierarchical, spatial and temporal structures of human actions. This proposed approach achieves promising performances in action recognition and localization on challenging benchmark datasets. Moreover, the discovered trees show good cross-dataset generalizability: trees learned on one dataset can be used to recognize and localize similar actions in another dataset. To handle large scale data, a deep model is explored that learns temporal progression of the actions using Long Short Term Memory (LSTM), which is a type of Recurrent Neural Network (RNN). Two novel ranking losses are proposed to train the model to better capture the temporal structures of actions for accurate action recognition and temporal localization. This model achieves state-of-art performance on a large scale video dataset. A deep model usually employs a Convolutional Neural Network (CNN) to learn visual features from video frames. The problem of utilizing web action images for training a Convolutional Neural Network (CNN) is also studied: training CNN typically requires a large number of training videos, but the findings of this study show that web action images can be utilized as additional training data to significantly reduce the burden of video training data collection.
author	Ma, Shugao
author_facet	Ma, Shugao
author_sort	Ma, Shugao
title	Learning space-time structures for action recognition and localization
title_short	Learning space-time structures for action recognition and localization
title_full	Learning space-time structures for action recognition and localization
title_fullStr	Learning space-time structures for action recognition and localization
title_full_unstemmed	Learning space-time structures for action recognition and localization
title_sort	learning space-time structures for action recognition and localization
publishDate	2016
url	https://hdl.handle.net/2144/17720
work_keys_str_mv	AT mashugao learningspacetimestructuresforactionrecognitionandlocalization
_version_	1719008298041081856

Learning space-time structures for action recognition and localization

Similar Items