Refined Spatial Network for Human Action Recognition

Effective video representation is a key ingredient in action recognition, but how to learn effective spatial features remains a fundamental and challenging task. The existing CNN-based methods apply low-resolution feature maps to get the high-level semantic labels. However, the slenderer spatial inf...

Full description

Bibliographic Details
Main Authors:	Chunlei Wu, Haiwen Cao, Weishan Zhang, Leiquan Wang, Yiwei Wei, Zexin Peng
Format:	Article
Language:	English
Published:	IEEE 2019-01-01
Series:	IEEE Access
Subjects:	Action recognition encoder-decoder spatial features
Online Access:	https://ieeexplore.ieee.org/document/8788614/

id	doaj-20df63b860824666942b2e1b9304dac6
record_format	Article
spelling	doaj-20df63b860824666942b2e1b9304dac62021-04-05T17:24:50ZengIEEEIEEE Access2169-35362019-01-01711104311105210.1109/ACCESS.2019.29333038788614Refined Spatial Network for Human Action RecognitionChunlei Wu0Haiwen Cao1Weishan Zhang2https://orcid.org/0000-0001-9800-1068Leiquan Wang3https://orcid.org/0000-0003-4314-0030Yiwei Wei4Zexin Peng5College of Computer and Communication Engineering, China University of Petroleum, Qingdao, ChinaCollege of Computer and Communication Engineering, China University of Petroleum, Qingdao, ChinaCollege of Computer and Communication Engineering, China University of Petroleum, Qingdao, ChinaCollege of Computer and Communication Engineering, China University of Petroleum, Qingdao, ChinaSchool of Petroleum Engineering, China University of Petroleum, Beijing, ChinaCollege of Computer and Communication Engineering, China University of Petroleum, Qingdao, ChinaEffective video representation is a key ingredient in action recognition, but how to learn effective spatial features remains a fundamental and challenging task. The existing CNN-based methods apply low-resolution feature maps to get the high-level semantic labels. However, the slenderer spatial information for action representation has lost. In this paper, we propose a novel stacked spatial network (SSN), which integrates multi-layer feature maps in an end-to-end manner. Spatial features extraction network based on encoder-decoder is firstly used to obtain multi-level and multi-resolution spatial features under the supervision of high-level sematic labels. The multi-level features are aggregated through a stacked spatial fusion layer, which intrinsically refines the traditional convolutional neural network. Then, refined spatial network (RSN) is proposed to aggregate spatial network and SSN. Particularly, the learned representation of RSN comprises two components for representing semantic label information and local slenderer spatial information. Extensive experimental results on UCF-101 and HMDB-51 datasets demonstrate the effectiveness of the proposed RSN.https://ieeexplore.ieee.org/document/8788614/Action recognitionencoder-decoderspatial features
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Chunlei Wu Haiwen Cao Weishan Zhang Leiquan Wang Yiwei Wei Zexin Peng
spellingShingle	Chunlei Wu Haiwen Cao Weishan Zhang Leiquan Wang Yiwei Wei Zexin Peng Refined Spatial Network for Human Action Recognition IEEE Access Action recognition encoder-decoder spatial features
author_facet	Chunlei Wu Haiwen Cao Weishan Zhang Leiquan Wang Yiwei Wei Zexin Peng
author_sort	Chunlei Wu
title	Refined Spatial Network for Human Action Recognition
title_short	Refined Spatial Network for Human Action Recognition
title_full	Refined Spatial Network for Human Action Recognition
title_fullStr	Refined Spatial Network for Human Action Recognition
title_full_unstemmed	Refined Spatial Network for Human Action Recognition
title_sort	refined spatial network for human action recognition
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2019-01-01
description	Effective video representation is a key ingredient in action recognition, but how to learn effective spatial features remains a fundamental and challenging task. The existing CNN-based methods apply low-resolution feature maps to get the high-level semantic labels. However, the slenderer spatial information for action representation has lost. In this paper, we propose a novel stacked spatial network (SSN), which integrates multi-layer feature maps in an end-to-end manner. Spatial features extraction network based on encoder-decoder is firstly used to obtain multi-level and multi-resolution spatial features under the supervision of high-level sematic labels. The multi-level features are aggregated through a stacked spatial fusion layer, which intrinsically refines the traditional convolutional neural network. Then, refined spatial network (RSN) is proposed to aggregate spatial network and SSN. Particularly, the learned representation of RSN comprises two components for representing semantic label information and local slenderer spatial information. Extensive experimental results on UCF-101 and HMDB-51 datasets demonstrate the effectiveness of the proposed RSN.
topic	Action recognition encoder-decoder spatial features
url	https://ieeexplore.ieee.org/document/8788614/
work_keys_str_mv	AT chunleiwu refinedspatialnetworkforhumanactionrecognition AT haiwencao refinedspatialnetworkforhumanactionrecognition AT weishanzhang refinedspatialnetworkforhumanactionrecognition AT leiquanwang refinedspatialnetworkforhumanactionrecognition AT yiweiwei refinedspatialnetworkforhumanactionrecognition AT zexinpeng refinedspatialnetworkforhumanactionrecognition
_version_	1721539642915815424

Refined Spatial Network for Human Action Recognition

Similar Items