Beyond Labels and Captions: Contextualizing Grounded Semantics for Explainable Visual Interpretation

One of the long-standing problems in artificial intelligence is the development of intelligent agents with complete visual understanding. Understanding entails recognition of scene attributes such as actors, objects and actions as well as reasoning about the common semantic structure that combines t...

Full description

Bibliographic Details
Main Author: Aakur, Sathyanarayanan Narasimhan
Format: Others
Published: Scholar Commons 2019
Subjects:
Online Access:https://scholarcommons.usf.edu/etd/7718
https://scholarcommons.usf.edu/cgi/viewcontent.cgi?article=8915&context=etd
Description
Summary:One of the long-standing problems in artificial intelligence is the development of intelligent agents with complete visual understanding. Understanding entails recognition of scene attributes such as actors, objects and actions as well as reasoning about the common semantic structure that combines these attributes into a coherent description. While significant milestones have been achieved in the field of computer vision, majority of the work has been concentrated on supervised visual recognition where complex visual representations are learned and a few discrete categories or labels are assigned to these representations. This implies a closed world where the underlying assumption is that all environments contain the same objects and events, which are in one-to-one correspondence with the ground evidence in the image. Hence, the learned knowledge is limited to the annotated training set. An open world, on the other hand, does not assume the distribution of semantics and requires generalization beyond the training annotations. Increasingly complex models require massive amounts of training data and offer little to no explainability due to the lack of transparency in the decision-making process. The strength of artificial intelligence systems to offer explanations for their decisions is central to building user confidence and structuring smart human-machine interactions. In this dissertation, we develop an inherently explainable approach for generating rich interpretations of visual scenes. We move towards an open world open-domain visual understanding by decoupling the ideas of recognition and reasoning. We integrate common sense knowledge from large knowledge bases such as ConceptNet and the representation learning capabilities of deep learning approaches in a pattern theory formalism to interpret a complex visual scene. To be specific, we first define and develop the idea of contextualization to model and establish complex semantic relationships among concepts grounded in visual data. The resulting semantic structures, called interpretations allow us to represent the visual scene in an intermediate representation that can then be used as the source of knowledge for various modes of expression such as labels, captions and even question answering. Second, we explore the inherent explainability of such visual interpretations and define key components for extending the notion of explainability to intelligent agents for visual recognition. Finally, we describe a self-supervised model for segmenting untrimmed videos into its constituent events. We show that this approach can segment videos without the need for supervision - neither implicit nor explicit. Combined, we argue that these approaches offer an elegant path to inherently explainable, open domain visual understanding while negating the need for human supervision in the form of labels and/or captions. We show that the proposed approach can advance the state-of-the-art results in complex benchmarks to handle data imbalance, complex semantics, and complex visual scenes without the need for vast amounts of domain-specific training data. Extensive experiments on several publicly available datasets show the efficacy of the proposed approaches. We show that the proposed approaches outperform weakly-supervised and unsupervised baselines by up to 24% and achieves competitive segmentation results compared to fully supervised baselines. The self-supervised approach for video segmentation complements this top-down inference with efficient bottom-up processing, resulting in an elegant formalism for open-domain visual understanding.