Summary: | Deep learning models have achieved remarkable success in many computer vision tasks. However, they typically rely on large amounts of carefully labeled training data whose annotating process is usually expensive, time-consuming and even infeasible when considering the task complexity and scarcity of expert knowledge. This reliance on high-quality human supervision has become the biggest bottleneck in scaling our models to tackle the vast space of possible visual tasks under real-world diverse settings. This dissertation is on visual learning with limited supervision. The scope of this dissertation mainly falls into the following two aspects: learning from data with weak forms of annotations and learning from multi-modal data pairs. Specifically, I will first present a guided attention learning framework to conduct semantic segmentation by mainly using image-level labels, as such weak form of annotation can be collected much more efficiently than pixel-level labels. Under mild assumptions, our framework can also be used as a plug-in to existing convolutional neural networks to improve their generalization performance. This is achieved by guiding the network to focus on correct things when learning concepts from a limited set of training samples. Then, I will introduce models that can effectively learn from multi-modal data pairs without relying on dense annotations of visual semantic concepts. Our models incorporate relational reasoning ability into the visual representation learning process so that it can be better aligned with the supervision from corresponding text descriptions. Finally, I will conclude the dissertation with summarizations of observations and discussions of potential future directions.--Author's abstract
|