Multi-View Visual Question Answering with Active Viewpoint Selection

This paper proposes a framework that allows the observation of a scene iteratively to answer a given question about the scene. Conventional visual question answering (VQA) methods are designed to answer given questions based on single-view images. However, in real-world applications, such as human–r...

Full description

Bibliographic Details
Main Authors: Yue Qiu, Yutaka Satoh, Ryota Suzuki, Kenji Iwata, Hirokatsu Kataoka
Format: Article
Language:English
Published: MDPI AG 2020-04-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/20/8/2281
id doaj-d85a4ca10a724f53b6a8aa1df363e7fd
record_format Article
spelling doaj-d85a4ca10a724f53b6a8aa1df363e7fd2020-11-25T02:37:36ZengMDPI AGSensors1424-82202020-04-01202281228110.3390/s20082281Multi-View Visual Question Answering with Active Viewpoint SelectionYue Qiu0Yutaka Satoh1Ryota Suzuki2Kenji Iwata3Hirokatsu Kataoka4Graduate School of Science and Technology, University of Tsukuba, Tsukuba 305-8577, JapanGraduate School of Science and Technology, University of Tsukuba, Tsukuba 305-8577, JapanNational Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, JapanNational Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, JapanNational Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, JapanThis paper proposes a framework that allows the observation of a scene iteratively to answer a given question about the scene. Conventional visual question answering (VQA) methods are designed to answer given questions based on single-view images. However, in real-world applications, such as human–robot interaction (HRI), in which camera angles and occluded scenes must be considered, answering questions based on single-view images might be difficult. Since HRI applications make it possible to observe a scene from multiple viewpoints, it is reasonable to discuss the VQA task in multi-view settings. In addition, because it is usually challenging to observe a scene from arbitrary viewpoints, we designed a framework that allows the observation of a scene actively until the necessary scene information to answer a given question is obtained. The proposed framework achieves comparable performance to a state-of-the-art method in question answering and simultaneously decreases the number of required observation viewpoints by a significant margin. Additionally, we found our framework plausibly learned to choose better viewpoints for answering questions, lowering the required number of camera movements. Moreover, we built a multi-view VQA dataset based on real images. The proposed framework shows high accuracy (94.01%) for the unseen real image dataset.https://www.mdpi.com/1424-8220/20/8/2281visual question answeringthree-dimensional (3D) visionreinforcement learningdeep learninghuman–robot interaction
collection DOAJ
language English
format Article
sources DOAJ
author Yue Qiu
Yutaka Satoh
Ryota Suzuki
Kenji Iwata
Hirokatsu Kataoka
spellingShingle Yue Qiu
Yutaka Satoh
Ryota Suzuki
Kenji Iwata
Hirokatsu Kataoka
Multi-View Visual Question Answering with Active Viewpoint Selection
Sensors
visual question answering
three-dimensional (3D) vision
reinforcement learning
deep learning
human–robot interaction
author_facet Yue Qiu
Yutaka Satoh
Ryota Suzuki
Kenji Iwata
Hirokatsu Kataoka
author_sort Yue Qiu
title Multi-View Visual Question Answering with Active Viewpoint Selection
title_short Multi-View Visual Question Answering with Active Viewpoint Selection
title_full Multi-View Visual Question Answering with Active Viewpoint Selection
title_fullStr Multi-View Visual Question Answering with Active Viewpoint Selection
title_full_unstemmed Multi-View Visual Question Answering with Active Viewpoint Selection
title_sort multi-view visual question answering with active viewpoint selection
publisher MDPI AG
series Sensors
issn 1424-8220
publishDate 2020-04-01
description This paper proposes a framework that allows the observation of a scene iteratively to answer a given question about the scene. Conventional visual question answering (VQA) methods are designed to answer given questions based on single-view images. However, in real-world applications, such as human–robot interaction (HRI), in which camera angles and occluded scenes must be considered, answering questions based on single-view images might be difficult. Since HRI applications make it possible to observe a scene from multiple viewpoints, it is reasonable to discuss the VQA task in multi-view settings. In addition, because it is usually challenging to observe a scene from arbitrary viewpoints, we designed a framework that allows the observation of a scene actively until the necessary scene information to answer a given question is obtained. The proposed framework achieves comparable performance to a state-of-the-art method in question answering and simultaneously decreases the number of required observation viewpoints by a significant margin. Additionally, we found our framework plausibly learned to choose better viewpoints for answering questions, lowering the required number of camera movements. Moreover, we built a multi-view VQA dataset based on real images. The proposed framework shows high accuracy (94.01%) for the unseen real image dataset.
topic visual question answering
three-dimensional (3D) vision
reinforcement learning
deep learning
human–robot interaction
url https://www.mdpi.com/1424-8220/20/8/2281
work_keys_str_mv AT yueqiu multiviewvisualquestionansweringwithactiveviewpointselection
AT yutakasatoh multiviewvisualquestionansweringwithactiveviewpointselection
AT ryotasuzuki multiviewvisualquestionansweringwithactiveviewpointselection
AT kenjiiwata multiviewvisualquestionansweringwithactiveviewpointselection
AT hirokatsukataoka multiviewvisualquestionansweringwithactiveviewpointselection
_version_ 1724794548357431296