AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions

Crowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used...

Full description

Bibliographic Details
Main Authors: Ruihan Hu, Qinglong Mo, Yuanfei Xie, Yongqian Xu, Jiaqi Chen, Yalun Yang, Hongjian Zhou, Zhi-Ri Tang, Edmond Q. Wu
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9416332/
id doaj-3ffb5e5cfb494ba3b2324d68e2d72155
record_format Article
spelling doaj-3ffb5e5cfb494ba3b2324d68e2d721552021-06-07T23:00:53ZengIEEEIEEE Access2169-35362021-01-019805008051010.1109/ACCESS.2021.30747979416332AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality ConditionsRuihan Hu0https://orcid.org/0000-0001-8525-2503Qinglong Mo1https://orcid.org/0000-0001-8525-2503Yuanfei Xie2Yongqian Xu3Jiaqi Chen4Yalun Yang5Hongjian Zhou6Zhi-Ri Tang7Edmond Q. Wu8https://orcid.org/0000-0002-4900-0787Guangdong Key Laboratory of Modern Control Technology, Guangdong Institute of Intelligent Manufacturing, Guangzhou, ChinaGuangdong Key Laboratory of Modern Control Technology, Guangdong Institute of Intelligent Manufacturing, Guangzhou, ChinaElectronic Information School, Wuhan University, Wuhan, ChinaGuangdong Key Laboratory of Modern Control Technology, Guangdong Institute of Intelligent Manufacturing, Guangzhou, ChinaGuangdong Key Laboratory of Modern Control Technology, Guangdong Institute of Intelligent Manufacturing, Guangzhou, ChinaDepartment of Automation, Shanghai Jiao Tong University, Shanghai, ChinaSchool of Mechanical and Electrical Engineering, Wuhan Institute of Technology, Wuhan, ChinaSchool of Physics and Technology, Wuhan University, Wuhan, ChinaDepartment of Automation, Shanghai Jiao Tong University, Shanghai, ChinaCrowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used widely as media platforms for human beings to touch the physical change of the world. The cross-modal information gives us an alternative method of solving the crowd counting task. In this case, in order to solve this problem, a model named the Audio-Visual Multi-Scale Network (AVMSN) is established to model the unconstrained visual and audio sources for completing the crowd counting task in this paper. Based on the Feature extraction and Multi-modal fusion module, in order to handle the objects of various sizes in the crowd scene, the Sample Convolutional Blocks are adopted by the AVMSN as the multi-scale Vision-end branch in the Feature extraction module to calculate the weighted-visual feature. Besides, the audio, which is the temporal domain transformed into the spectrogram information and the audio feature is learned by the audio-VGG network. Finally, the weighted-visual and audio features are fused by the Multi-modal fusion module, which adopts the cascade fusion architecture to calculate the estimated density map. The experimental results show the proposed AVMSN achieves a lower mean absolute error than other state-of-art crowd counting models under the low-quality conditions.https://ieeexplore.ieee.org/document/9416332/Multi-scale architectureaudio-visual modelcascade fusioncrowd counting
collection DOAJ
language English
format Article
sources DOAJ
author Ruihan Hu
Qinglong Mo
Yuanfei Xie
Yongqian Xu
Jiaqi Chen
Yalun Yang
Hongjian Zhou
Zhi-Ri Tang
Edmond Q. Wu
spellingShingle Ruihan Hu
Qinglong Mo
Yuanfei Xie
Yongqian Xu
Jiaqi Chen
Yalun Yang
Hongjian Zhou
Zhi-Ri Tang
Edmond Q. Wu
AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
IEEE Access
Multi-scale architecture
audio-visual model
cascade fusion
crowd counting
author_facet Ruihan Hu
Qinglong Mo
Yuanfei Xie
Yongqian Xu
Jiaqi Chen
Yalun Yang
Hongjian Zhou
Zhi-Ri Tang
Edmond Q. Wu
author_sort Ruihan Hu
title AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
title_short AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
title_full AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
title_fullStr AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
title_full_unstemmed AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
title_sort avmsn: an audio-visual two stream crowd counting framework under low-quality conditions
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2021-01-01
description Crowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used widely as media platforms for human beings to touch the physical change of the world. The cross-modal information gives us an alternative method of solving the crowd counting task. In this case, in order to solve this problem, a model named the Audio-Visual Multi-Scale Network (AVMSN) is established to model the unconstrained visual and audio sources for completing the crowd counting task in this paper. Based on the Feature extraction and Multi-modal fusion module, in order to handle the objects of various sizes in the crowd scene, the Sample Convolutional Blocks are adopted by the AVMSN as the multi-scale Vision-end branch in the Feature extraction module to calculate the weighted-visual feature. Besides, the audio, which is the temporal domain transformed into the spectrogram information and the audio feature is learned by the audio-VGG network. Finally, the weighted-visual and audio features are fused by the Multi-modal fusion module, which adopts the cascade fusion architecture to calculate the estimated density map. The experimental results show the proposed AVMSN achieves a lower mean absolute error than other state-of-art crowd counting models under the low-quality conditions.
topic Multi-scale architecture
audio-visual model
cascade fusion
crowd counting
url https://ieeexplore.ieee.org/document/9416332/
work_keys_str_mv AT ruihanhu avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
AT qinglongmo avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
AT yuanfeixie avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
AT yongqianxu avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
AT jiaqichen avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
AT yalunyang avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
AT hongjianzhou avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
AT zhiritang avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
AT edmondqwu avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
_version_ 1721391118878244864