AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions

Crowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used...

Full description

Bibliographic Details
Main Authors:	Ruihan Hu, Qinglong Mo, Yuanfei Xie, Yongqian Xu, Jiaqi Chen, Yalun Yang, Hongjian Zhou, Zhi-Ri Tang, Edmond Q. Wu
Format:	Article
Language:	English
Published:	IEEE 2021-01-01
Series:	IEEE Access
Subjects:	Multi-scale architecture audio-visual model cascade fusion crowd counting
Online Access:	https://ieeexplore.ieee.org/document/9416332/

id	doaj-3ffb5e5cfb494ba3b2324d68e2d72155
record_format	Article
spelling	doaj-3ffb5e5cfb494ba3b2324d68e2d721552021-06-07T23:00:53ZengIEEEIEEE Access2169-35362021-01-019805008051010.1109/ACCESS.2021.30747979416332AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality ConditionsRuihan Hu0https://orcid.org/0000-0001-8525-2503Qinglong Mo1https://orcid.org/0000-0001-8525-2503Yuanfei Xie2Yongqian Xu3Jiaqi Chen4Yalun Yang5Hongjian Zhou6Zhi-Ri Tang7Edmond Q. Wu8https://orcid.org/0000-0002-4900-0787Guangdong Key Laboratory of Modern Control Technology, Guangdong Institute of Intelligent Manufacturing, Guangzhou, ChinaGuangdong Key Laboratory of Modern Control Technology, Guangdong Institute of Intelligent Manufacturing, Guangzhou, ChinaElectronic Information School, Wuhan University, Wuhan, ChinaGuangdong Key Laboratory of Modern Control Technology, Guangdong Institute of Intelligent Manufacturing, Guangzhou, ChinaGuangdong Key Laboratory of Modern Control Technology, Guangdong Institute of Intelligent Manufacturing, Guangzhou, ChinaDepartment of Automation, Shanghai Jiao Tong University, Shanghai, ChinaSchool of Mechanical and Electrical Engineering, Wuhan Institute of Technology, Wuhan, ChinaSchool of Physics and Technology, Wuhan University, Wuhan, ChinaDepartment of Automation, Shanghai Jiao Tong University, Shanghai, ChinaCrowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used widely as media platforms for human beings to touch the physical change of the world. The cross-modal information gives us an alternative method of solving the crowd counting task. In this case, in order to solve this problem, a model named the Audio-Visual Multi-Scale Network (AVMSN) is established to model the unconstrained visual and audio sources for completing the crowd counting task in this paper. Based on the Feature extraction and Multi-modal fusion module, in order to handle the objects of various sizes in the crowd scene, the Sample Convolutional Blocks are adopted by the AVMSN as the multi-scale Vision-end branch in the Feature extraction module to calculate the weighted-visual feature. Besides, the audio, which is the temporal domain transformed into the spectrogram information and the audio feature is learned by the audio-VGG network. Finally, the weighted-visual and audio features are fused by the Multi-modal fusion module, which adopts the cascade fusion architecture to calculate the estimated density map. The experimental results show the proposed AVMSN achieves a lower mean absolute error than other state-of-art crowd counting models under the low-quality conditions.https://ieeexplore.ieee.org/document/9416332/Multi-scale architectureaudio-visual modelcascade fusioncrowd counting
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Ruihan Hu Qinglong Mo Yuanfei Xie Yongqian Xu Jiaqi Chen Yalun Yang Hongjian Zhou Zhi-Ri Tang Edmond Q. Wu
spellingShingle	Ruihan Hu Qinglong Mo Yuanfei Xie Yongqian Xu Jiaqi Chen Yalun Yang Hongjian Zhou Zhi-Ri Tang Edmond Q. Wu AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions IEEE Access Multi-scale architecture audio-visual model cascade fusion crowd counting
author_facet	Ruihan Hu Qinglong Mo Yuanfei Xie Yongqian Xu Jiaqi Chen Yalun Yang Hongjian Zhou Zhi-Ri Tang Edmond Q. Wu
author_sort	Ruihan Hu
title	AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
title_short	AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
title_full	AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
title_fullStr	AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
title_full_unstemmed	AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
title_sort	avmsn: an audio-visual two stream crowd counting framework under low-quality conditions
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2021-01-01
description	Crowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used widely as media platforms for human beings to touch the physical change of the world. The cross-modal information gives us an alternative method of solving the crowd counting task. In this case, in order to solve this problem, a model named the Audio-Visual Multi-Scale Network (AVMSN) is established to model the unconstrained visual and audio sources for completing the crowd counting task in this paper. Based on the Feature extraction and Multi-modal fusion module, in order to handle the objects of various sizes in the crowd scene, the Sample Convolutional Blocks are adopted by the AVMSN as the multi-scale Vision-end branch in the Feature extraction module to calculate the weighted-visual feature. Besides, the audio, which is the temporal domain transformed into the spectrogram information and the audio feature is learned by the audio-VGG network. Finally, the weighted-visual and audio features are fused by the Multi-modal fusion module, which adopts the cascade fusion architecture to calculate the estimated density map. The experimental results show the proposed AVMSN achieves a lower mean absolute error than other state-of-art crowd counting models under the low-quality conditions.
topic	Multi-scale architecture audio-visual model cascade fusion crowd counting
url	https://ieeexplore.ieee.org/document/9416332/
work_keys_str_mv	AT ruihanhu avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT qinglongmo avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT yuanfeixie avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT yongqianxu avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT jiaqichen avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT yalunyang avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT hongjianzhou avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT zhiritang avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT edmondqwu avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions
_version_	1721391118878244864

AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions

Similar Items