Action Recognition Using Deep 3D CNNs with Sequential Feature Aggregation and Attention

Action recognition is an active research field that aims to recognize human actions and intentions from a series of observations of human behavior and the environment. Unlike image-based action recognition mainly using a two-dimensional (2D) convolutional neural network (CNN), one of the difficultie...

Full description

Bibliographic Details
Main Authors: Fazliddin Anvarov, Dae Ha Kim, Byung Cheol Song
Format: Article
Language:English
Published: MDPI AG 2020-01-01
Series:Electronics
Subjects:
Online Access:https://www.mdpi.com/2079-9292/9/1/147
id doaj-b4c1f505fdc244349542ecb37067f413
record_format Article
spelling doaj-b4c1f505fdc244349542ecb37067f4132020-11-25T01:32:46ZengMDPI AGElectronics2079-92922020-01-019114710.3390/electronics9010147electronics9010147Action Recognition Using Deep 3D CNNs with Sequential Feature Aggregation and AttentionFazliddin Anvarov0Dae Ha Kim1Byung Cheol Song2Department of Electronic Engineering, Inha University, Incheon 22212, KoreaDepartment of Electronic Engineering, Inha University, Incheon 22212, KoreaDepartment of Electronic Engineering, Inha University, Incheon 22212, KoreaAction recognition is an active research field that aims to recognize human actions and intentions from a series of observations of human behavior and the environment. Unlike image-based action recognition mainly using a two-dimensional (2D) convolutional neural network (CNN), one of the difficulties in video-based action recognition is that video action behavior should be able to characterize both short-term small movements and long-term temporal appearance information. Previous methods aim at analyzing video action behavior only using a basic framework of 3D CNN. However, these approaches have a limitation on analyzing fast action movements or abruptly appearing objects because of the limited coverage of convolutional filter. In this paper, we propose the aggregation of squeeze-and-excitation (SE) and self-attention (SA) modules with 3D CNN to analyze both short and long-term temporal action behavior efficiently. We successfully implemented SE and SA modules to present a novel approach to video action recognition that builds upon the current state-of-the-art methods and demonstrates better performance with UCF-101 and HMDB51 datasets. For example, we get accuracies of 92.5% (16f-clip) and 95.6% (64f-clip) with the UCF-101 dataset, and 68.1% (16f-clip) and 74.1% (64f-clip) with HMDB51 for the ResNext-101 architecture in a 3D CNN.https://www.mdpi.com/2079-9292/9/1/147action recognition3d cnndeep feature attention
collection DOAJ
language English
format Article
sources DOAJ
author Fazliddin Anvarov
Dae Ha Kim
Byung Cheol Song
spellingShingle Fazliddin Anvarov
Dae Ha Kim
Byung Cheol Song
Action Recognition Using Deep 3D CNNs with Sequential Feature Aggregation and Attention
Electronics
action recognition
3d cnn
deep feature attention
author_facet Fazliddin Anvarov
Dae Ha Kim
Byung Cheol Song
author_sort Fazliddin Anvarov
title Action Recognition Using Deep 3D CNNs with Sequential Feature Aggregation and Attention
title_short Action Recognition Using Deep 3D CNNs with Sequential Feature Aggregation and Attention
title_full Action Recognition Using Deep 3D CNNs with Sequential Feature Aggregation and Attention
title_fullStr Action Recognition Using Deep 3D CNNs with Sequential Feature Aggregation and Attention
title_full_unstemmed Action Recognition Using Deep 3D CNNs with Sequential Feature Aggregation and Attention
title_sort action recognition using deep 3d cnns with sequential feature aggregation and attention
publisher MDPI AG
series Electronics
issn 2079-9292
publishDate 2020-01-01
description Action recognition is an active research field that aims to recognize human actions and intentions from a series of observations of human behavior and the environment. Unlike image-based action recognition mainly using a two-dimensional (2D) convolutional neural network (CNN), one of the difficulties in video-based action recognition is that video action behavior should be able to characterize both short-term small movements and long-term temporal appearance information. Previous methods aim at analyzing video action behavior only using a basic framework of 3D CNN. However, these approaches have a limitation on analyzing fast action movements or abruptly appearing objects because of the limited coverage of convolutional filter. In this paper, we propose the aggregation of squeeze-and-excitation (SE) and self-attention (SA) modules with 3D CNN to analyze both short and long-term temporal action behavior efficiently. We successfully implemented SE and SA modules to present a novel approach to video action recognition that builds upon the current state-of-the-art methods and demonstrates better performance with UCF-101 and HMDB51 datasets. For example, we get accuracies of 92.5% (16f-clip) and 95.6% (64f-clip) with the UCF-101 dataset, and 68.1% (16f-clip) and 74.1% (64f-clip) with HMDB51 for the ResNext-101 architecture in a 3D CNN.
topic action recognition
3d cnn
deep feature attention
url https://www.mdpi.com/2079-9292/9/1/147
work_keys_str_mv AT fazliddinanvarov actionrecognitionusingdeep3dcnnswithsequentialfeatureaggregationandattention
AT daehakim actionrecognitionusingdeep3dcnnswithsequentialfeatureaggregationandattention
AT byungcheolsong actionrecognitionusingdeep3dcnnswithsequentialfeatureaggregationandattention
_version_ 1725079877859672064