Summary: | One of the most important tasks of modern computer vision with a vast amount of applications is finding correspondences between local patches extracted from different views of a physical scene. In this thesis, we investigate three main axes of this problem. We first provide a critical review of the prior work related to methods for extracting local image descriptors. Next, we show that the intrinsic visual characteristics of a patch may fundamentally alter its matching process, and we show how to exploit this phenomenon to improve the matching performance. One of the main contributions of this thesis is a novel approach to describing and matching image patches. We introduce a per-patch adapted method which makes it possible to generate feature descriptors that use simple binary tests, but match the performance of methods of significantly higher complexity. We also demonstrate that our technique can be successfully generalised to other descriptors, thus showing its potential for more general applications. We then propose novel methods to learn compact and efficient patch representations using convolutional neural networks. We show that typically used approaches such as architectural expansions or hard negative mining are not essential for the success of such methods. Our convolutional descriptors outperform the state of the art approaches at a significant fraction of the computational cost. Lastly, we demonstrate that most of the work in the area suffers from non-reproducibilty and inconsistency of evaluation results. To that end, we introduce a novel dataset accompanied with improved protocols and benchmarks that will allow for reproducible results. More importantly, the scale of our dataset allows for experimentation with learning local feature descriptors from real-world data, something that has not been feasible so far due to the lack of data. This will allow improved results and new experiments especially in the context of deep learning and convolutional neural networks.
|