Learning Language-vision Correspondences

Given an unstructured collection of captioned images of cluttered scenes featuring a variety of objects, our goal is to simultaneously learn the names and appearances of the objects. Only a small fraction of local features within any given image are associated with a particular caption word, and cap...

Full description

Bibliographic Details
Main Author:	Jamieson, Michael
Other Authors:	Dickinson, Sven
Language:	en_ca
Published:	2010
Subjects:	image annotation object recognition 0984
Online Access:	http://hdl.handle.net/1807/26192

id	ndltd-TORONTO-oai-tspace.library.utoronto.ca-1807-26192
record_format	oai_dc
spelling	ndltd-TORONTO-oai-tspace.library.utoronto.ca-1807-261922013-04-19T19:55:10ZLearning Language-vision CorrespondencesJamieson, Michaelimage annotationobject recognition0984Given an unstructured collection of captioned images of cluttered scenes featuring a variety of objects, our goal is to simultaneously learn the names and appearances of the objects. Only a small fraction of local features within any given image are associated with a particular caption word, and captions may contain irrelevant words not associated with any image object. We propose a novel algorithm that uses the repetition of feature neighborhoods across training images and a measure of correspondence with caption words to learn meaningful feature configurations (representing named objects). We also introduce a graph-based appearance model that captures some of the structure of an object by encoding the spatial relationships among the local visual features. In an iterative procedure we use language (the words) to drive a perceptual grouping process that assembles an appearance model for a named object. We also exploit co-occurrences among appearance models to learn hierarchical appearance models. Results of applying our method to three data sets in a variety of conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions, we can learn both the names and appearances of objects, resulting in a set of models invariant to translation, scale, orientation, occlusion, and minor changes in viewpoint or articulation. These named models, in turn, are used to automatically annotate new, uncaptioned images, thereby facilitating keyword-based image retrieval.Dickinson, SvenStevenson, Suzanne2010-112011-02-15T22:02:50ZNO_RESTRICTION2011-02-15T22:02:50Z2011-02-15T22:02:50ZThesishttp://hdl.handle.net/1807/26192en_ca
collection	NDLTD
language	en_ca
sources	NDLTD
topic	image annotation object recognition 0984
spellingShingle	image annotation object recognition 0984 Jamieson, Michael Learning Language-vision Correspondences
description	Given an unstructured collection of captioned images of cluttered scenes featuring a variety of objects, our goal is to simultaneously learn the names and appearances of the objects. Only a small fraction of local features within any given image are associated with a particular caption word, and captions may contain irrelevant words not associated with any image object. We propose a novel algorithm that uses the repetition of feature neighborhoods across training images and a measure of correspondence with caption words to learn meaningful feature configurations (representing named objects). We also introduce a graph-based appearance model that captures some of the structure of an object by encoding the spatial relationships among the local visual features. In an iterative procedure we use language (the words) to drive a perceptual grouping process that assembles an appearance model for a named object. We also exploit co-occurrences among appearance models to learn hierarchical appearance models. Results of applying our method to three data sets in a variety of conditions demonstrate that from complex, cluttered, real-world scenes with noisy captions, we can learn both the names and appearances of objects, resulting in a set of models invariant to translation, scale, orientation, occlusion, and minor changes in viewpoint or articulation. These named models, in turn, are used to automatically annotate new, uncaptioned images, thereby facilitating keyword-based image retrieval.
author2	Dickinson, Sven
author_facet	Dickinson, Sven Jamieson, Michael
author	Jamieson, Michael
author_sort	Jamieson, Michael
title	Learning Language-vision Correspondences
title_short	Learning Language-vision Correspondences
title_full	Learning Language-vision Correspondences
title_fullStr	Learning Language-vision Correspondences
title_full_unstemmed	Learning Language-vision Correspondences
title_sort	learning language-vision correspondences
publishDate	2010
url	http://hdl.handle.net/1807/26192
work_keys_str_mv	AT jamiesonmichael learninglanguagevisioncorrespondences
_version_	1716581754632929280

Learning Language-vision Correspondences

Similar Items