AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions
Crowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used...
Main Authors: | , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2021-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9416332/ |
id |
doaj-3ffb5e5cfb494ba3b2324d68e2d72155 |
---|---|
record_format |
Article |
spelling |
doaj-3ffb5e5cfb494ba3b2324d68e2d721552021-06-07T23:00:53ZengIEEEIEEE Access2169-35362021-01-019805008051010.1109/ACCESS.2021.30747979416332AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality ConditionsRuihan Hu0https://orcid.org/0000-0001-8525-2503Qinglong Mo1https://orcid.org/0000-0001-8525-2503Yuanfei Xie2Yongqian Xu3Jiaqi Chen4Yalun Yang5Hongjian Zhou6Zhi-Ri Tang7Edmond Q. Wu8https://orcid.org/0000-0002-4900-0787Guangdong Key Laboratory of Modern Control Technology, Guangdong Institute of Intelligent Manufacturing, Guangzhou, ChinaGuangdong Key Laboratory of Modern Control Technology, Guangdong Institute of Intelligent Manufacturing, Guangzhou, ChinaElectronic Information School, Wuhan University, Wuhan, ChinaGuangdong Key Laboratory of Modern Control Technology, Guangdong Institute of Intelligent Manufacturing, Guangzhou, ChinaGuangdong Key Laboratory of Modern Control Technology, Guangdong Institute of Intelligent Manufacturing, Guangzhou, ChinaDepartment of Automation, Shanghai Jiao Tong University, Shanghai, ChinaSchool of Mechanical and Electrical Engineering, Wuhan Institute of Technology, Wuhan, ChinaSchool of Physics and Technology, Wuhan University, Wuhan, ChinaDepartment of Automation, Shanghai Jiao Tong University, Shanghai, ChinaCrowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used widely as media platforms for human beings to touch the physical change of the world. The cross-modal information gives us an alternative method of solving the crowd counting task. In this case, in order to solve this problem, a model named the Audio-Visual Multi-Scale Network (AVMSN) is established to model the unconstrained visual and audio sources for completing the crowd counting task in this paper. Based on the Feature extraction and Multi-modal fusion module, in order to handle the objects of various sizes in the crowd scene, the Sample Convolutional Blocks are adopted by the AVMSN as the multi-scale Vision-end branch in the Feature extraction module to calculate the weighted-visual feature. Besides, the audio, which is the temporal domain transformed into the spectrogram information and the audio feature is learned by the audio-VGG network. Finally, the weighted-visual and audio features are fused by the Multi-modal fusion module, which adopts the cascade fusion architecture to calculate the estimated density map. The experimental results show the proposed AVMSN achieves a lower mean absolute error than other state-of-art crowd counting models under the low-quality conditions.https://ieeexplore.ieee.org/document/9416332/Multi-scale architectureaudio-visual modelcascade fusioncrowd counting |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Ruihan Hu Qinglong Mo Yuanfei Xie Yongqian Xu Jiaqi Chen Yalun Yang Hongjian Zhou Zhi-Ri Tang Edmond Q. Wu |
spellingShingle |
Ruihan Hu Qinglong Mo Yuanfei Xie Yongqian Xu Jiaqi Chen Yalun Yang Hongjian Zhou Zhi-Ri Tang Edmond Q. Wu AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions IEEE Access Multi-scale architecture audio-visual model cascade fusion crowd counting |
author_facet |
Ruihan Hu Qinglong Mo Yuanfei Xie Yongqian Xu Jiaqi Chen Yalun Yang Hongjian Zhou Zhi-Ri Tang Edmond Q. Wu |
author_sort |
Ruihan Hu |
title |
AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions |
title_short |
AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions |
title_full |
AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions |
title_fullStr |
AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions |
title_full_unstemmed |
AVMSN: An Audio-Visual Two Stream Crowd Counting Framework Under Low-Quality Conditions |
title_sort |
avmsn: an audio-visual two stream crowd counting framework under low-quality conditions |
publisher |
IEEE |
series |
IEEE Access |
issn |
2169-3536 |
publishDate |
2021-01-01 |
description |
Crowd counting is considered as the essential computer vision application that uses the convolutional neural network to model the crowd density as the regression task. However, the vision-based models are hard to extract the feature under low-quality conditions. As we know, visual and audio are used widely as media platforms for human beings to touch the physical change of the world. The cross-modal information gives us an alternative method of solving the crowd counting task. In this case, in order to solve this problem, a model named the Audio-Visual Multi-Scale Network (AVMSN) is established to model the unconstrained visual and audio sources for completing the crowd counting task in this paper. Based on the Feature extraction and Multi-modal fusion module, in order to handle the objects of various sizes in the crowd scene, the Sample Convolutional Blocks are adopted by the AVMSN as the multi-scale Vision-end branch in the Feature extraction module to calculate the weighted-visual feature. Besides, the audio, which is the temporal domain transformed into the spectrogram information and the audio feature is learned by the audio-VGG network. Finally, the weighted-visual and audio features are fused by the Multi-modal fusion module, which adopts the cascade fusion architecture to calculate the estimated density map. The experimental results show the proposed AVMSN achieves a lower mean absolute error than other state-of-art crowd counting models under the low-quality conditions. |
topic |
Multi-scale architecture audio-visual model cascade fusion crowd counting |
url |
https://ieeexplore.ieee.org/document/9416332/ |
work_keys_str_mv |
AT ruihanhu avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT qinglongmo avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT yuanfeixie avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT yongqianxu avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT jiaqichen avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT yalunyang avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT hongjianzhou avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT zhiritang avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions AT edmondqwu avmsnanaudiovisualtwostreamcrowdcountingframeworkunderlowqualityconditions |
_version_ |
1721391118878244864 |