Summary: | The ultimate aim of research into computer vision is designing a system which interprets its surrounding environment in a similar way the human can do effortlessly. However, the state of technology is far from achieving such a goal. In this thesis different components of a computer vision system that are designed for the task of interpreting man-made scenes, in particular images of buildings, are described. The flow of information in the proposed system is bottom-up i.e., the image is first segmented into its meaningful components and subsequently the regions are labelled using a contextual classifier. Starting from simple observations concerning the human vision system and the gestalt laws of human perception, like the law of “good (simple) shape” and “perceptual grouping”, a blob detector is developed, that identifies components in a 2D image. These components are convex regions of interest, with interest being defined as significant gradient magnitude content. An eye tracking experiment is conducted, which shows that the regions identified by the blob detector, correlate significantly with the regions which drive the attention of viewers. Having identified these blobs, it is postulated that a blob represents an object, linguistically identified with its own semantic name. In other words, a blob may contain a window a door or a chimney in a building. These regions are used to identify and segment higher order structures in a building, like facade, window array and also environmental regions like sky and ground. Because of inconsistency in the unary features of buildings, a contextual learning algorithm is used to classify the segmented regions. A model which learns spatial and topological relationships between different objects from a set of hand-labelled data, is used. This model utilises this information in a MRF to achieve consistent labellings of new scenes.
|