Perceiving 3D structure and recognizing objects and their properties around
us is central to our understanding of the world. For example, when we drive a car from our home to the workplace, we constantly perceive 3D structure, recognise objects and their properties, and understand their functional attributes so as to interact with the environment. Such capabilities permit free and accurate movement in unknown environments and may seem like an easy task for humans. However, for computer systems using artificial vision, it is not. Thus, researchers from philosophy to neuroscience, from mathematics to computer science, have devoted ample time to understand the underlying principles for developing a vision system which would be able to see as well as we do. Such understanding of (sequences of) images is commonly known as Scene Understanding. It consists of solving three classical computer vision problems: recognition, reorganisation and reconstruction. In this dissertation, I focus on some of these problems and propose methods for solving them. The work can be divided into three main parts.
In the first part, I show how the problem of recognition and reorganisation can be improved by incorporating some prior information such as context. Specifically I propose novel algorithms to incorporate higher order information, such as context and label consistency over large regions efficiently in the MRF model with only unary and/or pairwise terms. Inference in a MRF is performed using a filter-based mean-field approach. I demonstrate this techniques on joint object and stereo labelling problems, as well as on object class segmentation, showing in addition for joint object-stereo labelling how the method provides an efficient approach for inference in product label spaces.
In the second part I propose methods that encapsulate the benefits of reconstruction,recognition and reorganisation so as to solve scene understanding problems. First I propose robust real-time systems that reconstruct dense 3D models of environments on-the-fly and associates them with object labels. This approach works for both indoor and outdoor scenes and scale to any size environments. Next I propose an algorithm to solve the problems of recovering intrinsic scene properties such as shape, reflectance and illumination from a single image, along with estimating the object and attribute segmentation separately. I formulate this joint estimation problem in an energy minimization framework which is able to capture the correlations between intrinsic properties (reflectance, shape, illumination), objects (table, tv-monitor), and materials (wooden, plastic) in a given scene. Finally I design an efficient filter-based mean-field algorithm that jointly estimates human segmentation, pose and depth given a pair of stereo images so as to capture the relationships between these three problems.
In the third part I show how human interaction can help in improving the
visual recognition task. I propose an interactive 3D labelling and segmentation system that aims to make acquiring segmented 3D models fast, simple, and userfriendly.
Carrying a body-worn depth camera, the environment is reconstructed using standard techniques. The user is able to reach out and touch surfaces in the
world, and provide object category labels through voice commands. These user provided data are used to learn random forest based object models on-the-fly.
Now when the user encounters a previously unobserved and unlabelled region of space, the forest predicts object labels for each voxel, and the volumetric mean-field based inference smooths the final output. I demonstrate compelling results on several sequences that generalizes to unseen regions of the world.
Department of Computing and Communication TechnologiesFaculty of Technology, Design and Environment
Year available: 2014
Published by Oxford Brookes UniversityAll rights reserved
RADAR: Research Archive and Digital Asset RepositoryAbout RADAR