Summary: | Effective video representation is a key ingredient in action recognition, but how to learn effective spatial features remains a fundamental and challenging task. The existing CNN-based methods apply low-resolution feature maps to get the high-level semantic labels. However, the slenderer spatial information for action representation has lost. In this paper, we propose a novel stacked spatial network (SSN), which integrates multi-layer feature maps in an end-to-end manner. Spatial features extraction network based on encoder-decoder is firstly used to obtain multi-level and multi-resolution spatial features under the supervision of high-level sematic labels. The multi-level features are aggregated through a stacked spatial fusion layer, which intrinsically refines the traditional convolutional neural network. Then, refined spatial network (RSN) is proposed to aggregate spatial network and SSN. Particularly, the learned representation of RSN comprises two components for representing semantic label information and local slenderer spatial information. Extensive experimental results on UCF-101 and HMDB-51 datasets demonstrate the effectiveness of the proposed RSN.
|