Multi-View Visual Question Answering with Active Viewpoint Selection
This paper proposes a framework that allows the observation of a scene iteratively to answer a given question about the scene. Conventional visual question answering (VQA) methods are designed to answer given questions based on single-view images. However, in real-world applications, such as human–r...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2020-04-01
|
Series: | Sensors |
Subjects: | |
Online Access: | https://www.mdpi.com/1424-8220/20/8/2281 |
id |
doaj-d85a4ca10a724f53b6a8aa1df363e7fd |
---|---|
record_format |
Article |
spelling |
doaj-d85a4ca10a724f53b6a8aa1df363e7fd2020-11-25T02:37:36ZengMDPI AGSensors1424-82202020-04-01202281228110.3390/s20082281Multi-View Visual Question Answering with Active Viewpoint SelectionYue Qiu0Yutaka Satoh1Ryota Suzuki2Kenji Iwata3Hirokatsu Kataoka4Graduate School of Science and Technology, University of Tsukuba, Tsukuba 305-8577, JapanGraduate School of Science and Technology, University of Tsukuba, Tsukuba 305-8577, JapanNational Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, JapanNational Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, JapanNational Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, JapanThis paper proposes a framework that allows the observation of a scene iteratively to answer a given question about the scene. Conventional visual question answering (VQA) methods are designed to answer given questions based on single-view images. However, in real-world applications, such as human–robot interaction (HRI), in which camera angles and occluded scenes must be considered, answering questions based on single-view images might be difficult. Since HRI applications make it possible to observe a scene from multiple viewpoints, it is reasonable to discuss the VQA task in multi-view settings. In addition, because it is usually challenging to observe a scene from arbitrary viewpoints, we designed a framework that allows the observation of a scene actively until the necessary scene information to answer a given question is obtained. The proposed framework achieves comparable performance to a state-of-the-art method in question answering and simultaneously decreases the number of required observation viewpoints by a significant margin. Additionally, we found our framework plausibly learned to choose better viewpoints for answering questions, lowering the required number of camera movements. Moreover, we built a multi-view VQA dataset based on real images. The proposed framework shows high accuracy (94.01%) for the unseen real image dataset.https://www.mdpi.com/1424-8220/20/8/2281visual question answeringthree-dimensional (3D) visionreinforcement learningdeep learninghuman–robot interaction |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Yue Qiu Yutaka Satoh Ryota Suzuki Kenji Iwata Hirokatsu Kataoka |
spellingShingle |
Yue Qiu Yutaka Satoh Ryota Suzuki Kenji Iwata Hirokatsu Kataoka Multi-View Visual Question Answering with Active Viewpoint Selection Sensors visual question answering three-dimensional (3D) vision reinforcement learning deep learning human–robot interaction |
author_facet |
Yue Qiu Yutaka Satoh Ryota Suzuki Kenji Iwata Hirokatsu Kataoka |
author_sort |
Yue Qiu |
title |
Multi-View Visual Question Answering with Active Viewpoint Selection |
title_short |
Multi-View Visual Question Answering with Active Viewpoint Selection |
title_full |
Multi-View Visual Question Answering with Active Viewpoint Selection |
title_fullStr |
Multi-View Visual Question Answering with Active Viewpoint Selection |
title_full_unstemmed |
Multi-View Visual Question Answering with Active Viewpoint Selection |
title_sort |
multi-view visual question answering with active viewpoint selection |
publisher |
MDPI AG |
series |
Sensors |
issn |
1424-8220 |
publishDate |
2020-04-01 |
description |
This paper proposes a framework that allows the observation of a scene iteratively to answer a given question about the scene. Conventional visual question answering (VQA) methods are designed to answer given questions based on single-view images. However, in real-world applications, such as human–robot interaction (HRI), in which camera angles and occluded scenes must be considered, answering questions based on single-view images might be difficult. Since HRI applications make it possible to observe a scene from multiple viewpoints, it is reasonable to discuss the VQA task in multi-view settings. In addition, because it is usually challenging to observe a scene from arbitrary viewpoints, we designed a framework that allows the observation of a scene actively until the necessary scene information to answer a given question is obtained. The proposed framework achieves comparable performance to a state-of-the-art method in question answering and simultaneously decreases the number of required observation viewpoints by a significant margin. Additionally, we found our framework plausibly learned to choose better viewpoints for answering questions, lowering the required number of camera movements. Moreover, we built a multi-view VQA dataset based on real images. The proposed framework shows high accuracy (94.01%) for the unseen real image dataset. |
topic |
visual question answering three-dimensional (3D) vision reinforcement learning deep learning human–robot interaction |
url |
https://www.mdpi.com/1424-8220/20/8/2281 |
work_keys_str_mv |
AT yueqiu multiviewvisualquestionansweringwithactiveviewpointselection AT yutakasatoh multiviewvisualquestionansweringwithactiveviewpointselection AT ryotasuzuki multiviewvisualquestionansweringwithactiveviewpointselection AT kenjiiwata multiviewvisualquestionansweringwithactiveviewpointselection AT hirokatsukataoka multiviewvisualquestionansweringwithactiveviewpointselection |
_version_ |
1724794548357431296 |