Refined Spatial Network for Human Action Recognition
Effective video representation is a key ingredient in action recognition, but how to learn effective spatial features remains a fundamental and challenging task. The existing CNN-based methods apply low-resolution feature maps to get the high-level semantic labels. However, the slenderer spatial inf...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2019-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/8788614/ |
id |
doaj-20df63b860824666942b2e1b9304dac6 |
---|---|
record_format |
Article |
spelling |
doaj-20df63b860824666942b2e1b9304dac62021-04-05T17:24:50ZengIEEEIEEE Access2169-35362019-01-01711104311105210.1109/ACCESS.2019.29333038788614Refined Spatial Network for Human Action RecognitionChunlei Wu0Haiwen Cao1Weishan Zhang2https://orcid.org/0000-0001-9800-1068Leiquan Wang3https://orcid.org/0000-0003-4314-0030Yiwei Wei4Zexin Peng5College of Computer and Communication Engineering, China University of Petroleum, Qingdao, ChinaCollege of Computer and Communication Engineering, China University of Petroleum, Qingdao, ChinaCollege of Computer and Communication Engineering, China University of Petroleum, Qingdao, ChinaCollege of Computer and Communication Engineering, China University of Petroleum, Qingdao, ChinaSchool of Petroleum Engineering, China University of Petroleum, Beijing, ChinaCollege of Computer and Communication Engineering, China University of Petroleum, Qingdao, ChinaEffective video representation is a key ingredient in action recognition, but how to learn effective spatial features remains a fundamental and challenging task. The existing CNN-based methods apply low-resolution feature maps to get the high-level semantic labels. However, the slenderer spatial information for action representation has lost. In this paper, we propose a novel stacked spatial network (SSN), which integrates multi-layer feature maps in an end-to-end manner. Spatial features extraction network based on encoder-decoder is firstly used to obtain multi-level and multi-resolution spatial features under the supervision of high-level sematic labels. The multi-level features are aggregated through a stacked spatial fusion layer, which intrinsically refines the traditional convolutional neural network. Then, refined spatial network (RSN) is proposed to aggregate spatial network and SSN. Particularly, the learned representation of RSN comprises two components for representing semantic label information and local slenderer spatial information. Extensive experimental results on UCF-101 and HMDB-51 datasets demonstrate the effectiveness of the proposed RSN.https://ieeexplore.ieee.org/document/8788614/Action recognitionencoder-decoderspatial features |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Chunlei Wu Haiwen Cao Weishan Zhang Leiquan Wang Yiwei Wei Zexin Peng |
spellingShingle |
Chunlei Wu Haiwen Cao Weishan Zhang Leiquan Wang Yiwei Wei Zexin Peng Refined Spatial Network for Human Action Recognition IEEE Access Action recognition encoder-decoder spatial features |
author_facet |
Chunlei Wu Haiwen Cao Weishan Zhang Leiquan Wang Yiwei Wei Zexin Peng |
author_sort |
Chunlei Wu |
title |
Refined Spatial Network for Human Action Recognition |
title_short |
Refined Spatial Network for Human Action Recognition |
title_full |
Refined Spatial Network for Human Action Recognition |
title_fullStr |
Refined Spatial Network for Human Action Recognition |
title_full_unstemmed |
Refined Spatial Network for Human Action Recognition |
title_sort |
refined spatial network for human action recognition |
publisher |
IEEE |
series |
IEEE Access |
issn |
2169-3536 |
publishDate |
2019-01-01 |
description |
Effective video representation is a key ingredient in action recognition, but how to learn effective spatial features remains a fundamental and challenging task. The existing CNN-based methods apply low-resolution feature maps to get the high-level semantic labels. However, the slenderer spatial information for action representation has lost. In this paper, we propose a novel stacked spatial network (SSN), which integrates multi-layer feature maps in an end-to-end manner. Spatial features extraction network based on encoder-decoder is firstly used to obtain multi-level and multi-resolution spatial features under the supervision of high-level sematic labels. The multi-level features are aggregated through a stacked spatial fusion layer, which intrinsically refines the traditional convolutional neural network. Then, refined spatial network (RSN) is proposed to aggregate spatial network and SSN. Particularly, the learned representation of RSN comprises two components for representing semantic label information and local slenderer spatial information. Extensive experimental results on UCF-101 and HMDB-51 datasets demonstrate the effectiveness of the proposed RSN. |
topic |
Action recognition encoder-decoder spatial features |
url |
https://ieeexplore.ieee.org/document/8788614/ |
work_keys_str_mv |
AT chunleiwu refinedspatialnetworkforhumanactionrecognition AT haiwencao refinedspatialnetworkforhumanactionrecognition AT weishanzhang refinedspatialnetworkforhumanactionrecognition AT leiquanwang refinedspatialnetworkforhumanactionrecognition AT yiweiwei refinedspatialnetworkforhumanactionrecognition AT zexinpeng refinedspatialnetworkforhumanactionrecognition |
_version_ |
1721539642915815424 |