Refined Spatial Network for Human Action Recognition

Effective video representation is a key ingredient in action recognition, but how to learn effective spatial features remains a fundamental and challenging task. The existing CNN-based methods apply low-resolution feature maps to get the high-level semantic labels. However, the slenderer spatial inf...

Full description

Bibliographic Details
Main Authors: Chunlei Wu, Haiwen Cao, Weishan Zhang, Leiquan Wang, Yiwei Wei, Zexin Peng
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8788614/
id doaj-20df63b860824666942b2e1b9304dac6
record_format Article
spelling doaj-20df63b860824666942b2e1b9304dac62021-04-05T17:24:50ZengIEEEIEEE Access2169-35362019-01-01711104311105210.1109/ACCESS.2019.29333038788614Refined Spatial Network for Human Action RecognitionChunlei Wu0Haiwen Cao1Weishan Zhang2https://orcid.org/0000-0001-9800-1068Leiquan Wang3https://orcid.org/0000-0003-4314-0030Yiwei Wei4Zexin Peng5College of Computer and Communication Engineering, China University of Petroleum, Qingdao, ChinaCollege of Computer and Communication Engineering, China University of Petroleum, Qingdao, ChinaCollege of Computer and Communication Engineering, China University of Petroleum, Qingdao, ChinaCollege of Computer and Communication Engineering, China University of Petroleum, Qingdao, ChinaSchool of Petroleum Engineering, China University of Petroleum, Beijing, ChinaCollege of Computer and Communication Engineering, China University of Petroleum, Qingdao, ChinaEffective video representation is a key ingredient in action recognition, but how to learn effective spatial features remains a fundamental and challenging task. The existing CNN-based methods apply low-resolution feature maps to get the high-level semantic labels. However, the slenderer spatial information for action representation has lost. In this paper, we propose a novel stacked spatial network (SSN), which integrates multi-layer feature maps in an end-to-end manner. Spatial features extraction network based on encoder-decoder is firstly used to obtain multi-level and multi-resolution spatial features under the supervision of high-level sematic labels. The multi-level features are aggregated through a stacked spatial fusion layer, which intrinsically refines the traditional convolutional neural network. Then, refined spatial network (RSN) is proposed to aggregate spatial network and SSN. Particularly, the learned representation of RSN comprises two components for representing semantic label information and local slenderer spatial information. Extensive experimental results on UCF-101 and HMDB-51 datasets demonstrate the effectiveness of the proposed RSN.https://ieeexplore.ieee.org/document/8788614/Action recognitionencoder-decoderspatial features
collection DOAJ
language English
format Article
sources DOAJ
author Chunlei Wu
Haiwen Cao
Weishan Zhang
Leiquan Wang
Yiwei Wei
Zexin Peng
spellingShingle Chunlei Wu
Haiwen Cao
Weishan Zhang
Leiquan Wang
Yiwei Wei
Zexin Peng
Refined Spatial Network for Human Action Recognition
IEEE Access
Action recognition
encoder-decoder
spatial features
author_facet Chunlei Wu
Haiwen Cao
Weishan Zhang
Leiquan Wang
Yiwei Wei
Zexin Peng
author_sort Chunlei Wu
title Refined Spatial Network for Human Action Recognition
title_short Refined Spatial Network for Human Action Recognition
title_full Refined Spatial Network for Human Action Recognition
title_fullStr Refined Spatial Network for Human Action Recognition
title_full_unstemmed Refined Spatial Network for Human Action Recognition
title_sort refined spatial network for human action recognition
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2019-01-01
description Effective video representation is a key ingredient in action recognition, but how to learn effective spatial features remains a fundamental and challenging task. The existing CNN-based methods apply low-resolution feature maps to get the high-level semantic labels. However, the slenderer spatial information for action representation has lost. In this paper, we propose a novel stacked spatial network (SSN), which integrates multi-layer feature maps in an end-to-end manner. Spatial features extraction network based on encoder-decoder is firstly used to obtain multi-level and multi-resolution spatial features under the supervision of high-level sematic labels. The multi-level features are aggregated through a stacked spatial fusion layer, which intrinsically refines the traditional convolutional neural network. Then, refined spatial network (RSN) is proposed to aggregate spatial network and SSN. Particularly, the learned representation of RSN comprises two components for representing semantic label information and local slenderer spatial information. Extensive experimental results on UCF-101 and HMDB-51 datasets demonstrate the effectiveness of the proposed RSN.
topic Action recognition
encoder-decoder
spatial features
url https://ieeexplore.ieee.org/document/8788614/
work_keys_str_mv AT chunleiwu refinedspatialnetworkforhumanactionrecognition
AT haiwencao refinedspatialnetworkforhumanactionrecognition
AT weishanzhang refinedspatialnetworkforhumanactionrecognition
AT leiquanwang refinedspatialnetworkforhumanactionrecognition
AT yiweiwei refinedspatialnetworkforhumanactionrecognition
AT zexinpeng refinedspatialnetworkforhumanactionrecognition
_version_ 1721539642915815424