Beyond Labels and Captions: Contextualizing Grounded Semantics for Explainable Visual Interpretation

One of the long-standing problems in artificial intelligence is the development of intelligent agents with complete visual understanding. Understanding entails recognition of scene attributes such as actors, objects and actions as well as reasoning about the common semantic structure that combines t...

Full description

Bibliographic Details
Main Author:	Aakur, Sathyanarayanan Narasimhan
Format:	Others
Published:	Scholar Commons 2019
Subjects:	Common sense knowledge Visual understanding ConceptNet Computer Sciences
Online Access:	https://scholarcommons.usf.edu/etd/7718 https://scholarcommons.usf.edu/cgi/viewcontent.cgi?article=8915&context=etd

id	ndltd-USF-oai-scholarcommons.usf.edu-etd-8915
record_format	oai_dc
collection	NDLTD
format	Others
sources	NDLTD
topic	Common sense knowledge Visual understanding ConceptNet Computer Sciences
spellingShingle	Common sense knowledge Visual understanding ConceptNet Computer Sciences Aakur, Sathyanarayanan Narasimhan Beyond Labels and Captions: Contextualizing Grounded Semantics for Explainable Visual Interpretation
description	One of the long-standing problems in artificial intelligence is the development of intelligent agents with complete visual understanding. Understanding entails recognition of scene attributes such as actors, objects and actions as well as reasoning about the common semantic structure that combines these attributes into a coherent description. While significant milestones have been achieved in the field of computer vision, majority of the work has been concentrated on supervised visual recognition where complex visual representations are learned and a few discrete categories or labels are assigned to these representations. This implies a closed world where the underlying assumption is that all environments contain the same objects and events, which are in one-to-one correspondence with the ground evidence in the image. Hence, the learned knowledge is limited to the annotated training set. An open world, on the other hand, does not assume the distribution of semantics and requires generalization beyond the training annotations. Increasingly complex models require massive amounts of training data and offer little to no explainability due to the lack of transparency in the decision-making process. The strength of artificial intelligence systems to offer explanations for their decisions is central to building user confidence and structuring smart human-machine interactions. In this dissertation, we develop an inherently explainable approach for generating rich interpretations of visual scenes. We move towards an open world open-domain visual understanding by decoupling the ideas of recognition and reasoning. We integrate common sense knowledge from large knowledge bases such as ConceptNet and the representation learning capabilities of deep learning approaches in a pattern theory formalism to interpret a complex visual scene. To be specific, we first define and develop the idea of contextualization to model and establish complex semantic relationships among concepts grounded in visual data. The resulting semantic structures, called interpretations allow us to represent the visual scene in an intermediate representation that can then be used as the source of knowledge for various modes of expression such as labels, captions and even question answering. Second, we explore the inherent explainability of such visual interpretations and define key components for extending the notion of explainability to intelligent agents for visual recognition. Finally, we describe a self-supervised model for segmenting untrimmed videos into its constituent events. We show that this approach can segment videos without the need for supervision - neither implicit nor explicit. Combined, we argue that these approaches offer an elegant path to inherently explainable, open domain visual understanding while negating the need for human supervision in the form of labels and/or captions. We show that the proposed approach can advance the state-of-the-art results in complex benchmarks to handle data imbalance, complex semantics, and complex visual scenes without the need for vast amounts of domain-specific training data. Extensive experiments on several publicly available datasets show the efficacy of the proposed approaches. We show that the proposed approaches outperform weakly-supervised and unsupervised baselines by up to 24% and achieves competitive segmentation results compared to fully supervised baselines. The self-supervised approach for video segmentation complements this top-down inference with efficient bottom-up processing, resulting in an elegant formalism for open-domain visual understanding.
author	Aakur, Sathyanarayanan Narasimhan
author_facet	Aakur, Sathyanarayanan Narasimhan
author_sort	Aakur, Sathyanarayanan Narasimhan
title	Beyond Labels and Captions: Contextualizing Grounded Semantics for Explainable Visual Interpretation
title_short	Beyond Labels and Captions: Contextualizing Grounded Semantics for Explainable Visual Interpretation
title_full	Beyond Labels and Captions: Contextualizing Grounded Semantics for Explainable Visual Interpretation
title_fullStr	Beyond Labels and Captions: Contextualizing Grounded Semantics for Explainable Visual Interpretation
title_full_unstemmed	Beyond Labels and Captions: Contextualizing Grounded Semantics for Explainable Visual Interpretation
title_sort	beyond labels and captions: contextualizing grounded semantics for explainable visual interpretation
publisher	Scholar Commons
publishDate	2019
url	https://scholarcommons.usf.edu/etd/7718 https://scholarcommons.usf.edu/cgi/viewcontent.cgi?article=8915&context=etd
work_keys_str_mv	AT aakursathyanarayanannarasimhan beyondlabelsandcaptionscontextualizinggroundedsemanticsforexplainablevisualinterpretation
_version_	1719295179791269888
spelling	ndltd-USF-oai-scholarcommons.usf.edu-etd-89152019-11-22T10:12:27Z Beyond Labels and Captions: Contextualizing Grounded Semantics for Explainable Visual Interpretation Aakur, Sathyanarayanan Narasimhan One of the long-standing problems in artificial intelligence is the development of intelligent agents with complete visual understanding. Understanding entails recognition of scene attributes such as actors, objects and actions as well as reasoning about the common semantic structure that combines these attributes into a coherent description. While significant milestones have been achieved in the field of computer vision, majority of the work has been concentrated on supervised visual recognition where complex visual representations are learned and a few discrete categories or labels are assigned to these representations. This implies a closed world where the underlying assumption is that all environments contain the same objects and events, which are in one-to-one correspondence with the ground evidence in the image. Hence, the learned knowledge is limited to the annotated training set. An open world, on the other hand, does not assume the distribution of semantics and requires generalization beyond the training annotations. Increasingly complex models require massive amounts of training data and offer little to no explainability due to the lack of transparency in the decision-making process. The strength of artificial intelligence systems to offer explanations for their decisions is central to building user confidence and structuring smart human-machine interactions. In this dissertation, we develop an inherently explainable approach for generating rich interpretations of visual scenes. We move towards an open world open-domain visual understanding by decoupling the ideas of recognition and reasoning. We integrate common sense knowledge from large knowledge bases such as ConceptNet and the representation learning capabilities of deep learning approaches in a pattern theory formalism to interpret a complex visual scene. To be specific, we first define and develop the idea of contextualization to model and establish complex semantic relationships among concepts grounded in visual data. The resulting semantic structures, called interpretations allow us to represent the visual scene in an intermediate representation that can then be used as the source of knowledge for various modes of expression such as labels, captions and even question answering. Second, we explore the inherent explainability of such visual interpretations and define key components for extending the notion of explainability to intelligent agents for visual recognition. Finally, we describe a self-supervised model for segmenting untrimmed videos into its constituent events. We show that this approach can segment videos without the need for supervision - neither implicit nor explicit. Combined, we argue that these approaches offer an elegant path to inherently explainable, open domain visual understanding while negating the need for human supervision in the form of labels and/or captions. We show that the proposed approach can advance the state-of-the-art results in complex benchmarks to handle data imbalance, complex semantics, and complex visual scenes without the need for vast amounts of domain-specific training data. Extensive experiments on several publicly available datasets show the efficacy of the proposed approaches. We show that the proposed approaches outperform weakly-supervised and unsupervised baselines by up to 24% and achieves competitive segmentation results compared to fully supervised baselines. The self-supervised approach for video segmentation complements this top-down inference with efficient bottom-up processing, resulting in an elegant formalism for open-domain visual understanding. 2019-06-28T07:00:00Z text application/pdf https://scholarcommons.usf.edu/etd/7718 https://scholarcommons.usf.edu/cgi/viewcontent.cgi?article=8915&context=etd Graduate Theses and Dissertations Scholar Commons Common sense knowledge Visual understanding ConceptNet Computer Sciences

Beyond Labels and Captions: Contextualizing Grounded Semantics for Explainable Visual Interpretation

Similar Items