Inference Machines: Parsing Scenes via Iterated Predictions
Extracting a rich representation of an environment from visual sensor readings canbenefit many tasks in robotics, e.g., path planning, mapping, and object manipulation.While important progress has been made, it remains a difficult problem to effectivelyparse entire scenes, i.e., to recognize semanti...
Main Author: | |
---|---|
Format: | Others |
Published: |
Research Showcase @ CMU
2013
|
Subjects: | |
Online Access: | http://repository.cmu.edu/dissertations/305 http://repository.cmu.edu/cgi/viewcontent.cgi?article=1309&context=dissertations |
Summary: | Extracting a rich representation of an environment from visual sensor readings canbenefit many tasks in robotics, e.g., path planning, mapping, and object manipulation.While important progress has been made, it remains a difficult problem to effectivelyparse entire scenes, i.e., to recognize semantic objects, man-made structures, and landforms.This process requires not only recognizing individual entities but also understandingthe contextual relations among them.
The prevalent approach to encode such relationships is to use a joint probabilistic orenergy-based model which enables one to naturally write down these interactions. Unfortunately,performing exact inference over these expressive models is often intractableand instead we can only approximate the solutions. While there exists a set of sophisticatedapproximate inference techniques to choose from, the combination of learning andapproximate inference for these expressive models is still poorly understood in theoryand limited in practice. Furthermore, using approximate inference on any learned modeloften leads to suboptimal predictions due to the inherent approximations.
As we ultimately care about predicting the correct labeling of a scene, and notnecessarily learning a joint model of the data, this work proposes to instead view theapproximate inference process as a modular procedure that is directly trained in orderto produce a correct labeling of the scene. Inspired by early hierarchical models in thecomputer vision literature for scene parsing, the proposed inference procedure is structuredto incorporate both feature descriptors and contextual cues computed at multipleresolutions within the scene. We demonstrate that this inference machine frameworkfor parsing scenes via iterated predictions offers the best of both worlds: state-of-the-artclassification accuracy and computational efficiency when processing images and/orunorganized 3-D point clouds. Additionally, we address critical problems that arise inpractice when parsing scenes on board real-world systems: integrating data from multiplesensor modalities and efficiently processing data that is continuously streaming fromthe sensors. |
---|