Refined Spatial Network for Human Action Recognition

Effective video representation is a key ingredient in action recognition, but how to learn effective spatial features remains a fundamental and challenging task. The existing CNN-based methods apply low-resolution feature maps to get the high-level semantic labels. However, the slenderer spatial inf...

Full description

Bibliographic Details
Main Authors: Chunlei Wu, Haiwen Cao, Weishan Zhang, Leiquan Wang, Yiwei Wei, Zexin Peng
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8788614/
Description
Summary:Effective video representation is a key ingredient in action recognition, but how to learn effective spatial features remains a fundamental and challenging task. The existing CNN-based methods apply low-resolution feature maps to get the high-level semantic labels. However, the slenderer spatial information for action representation has lost. In this paper, we propose a novel stacked spatial network (SSN), which integrates multi-layer feature maps in an end-to-end manner. Spatial features extraction network based on encoder-decoder is firstly used to obtain multi-level and multi-resolution spatial features under the supervision of high-level sematic labels. The multi-level features are aggregated through a stacked spatial fusion layer, which intrinsically refines the traditional convolutional neural network. Then, refined spatial network (RSN) is proposed to aggregate spatial network and SSN. Particularly, the learned representation of RSN comprises two components for representing semantic label information and local slenderer spatial information. Extensive experimental results on UCF-101 and HMDB-51 datasets demonstrate the effectiveness of the proposed RSN.
ISSN:2169-3536