Summary: | Deep learning methods have led to substantial improvement of performance in many computer vision applications. However, these methods require massive resources, including data collection, label annotation, and computation, which may be insufficient in real-world applications. The constraints of resources limit the deployment of powerful deep models, resulting in degraded performance. Therefore, research topics that address resource limitations, such as few-shot/zero-shot recognition, have recently drawn attention. In this thesis, we develop machine learning methods that reduce the requirement of resources while keeping the prediction accuracy on par with resource-rich models. Specifically, we consider three different settings: label-limited recognition, zero-shot detection, and reducing energy consumption for IoT systems at inference time.
We first propose a novel image encoding method that decomposes an image into a few semantic parts and represents each part in a compact vocabulary of a few concepts. Because the concepts learned by our model generalize well to novel objects, this encoding shows competent results in label-limited classification tasks like few-shot/zero-shot recognition and unsupervised domain adaptation. The encoding also demonstrates extraordinary robustness to adversarial image perturbations, and we found the encoding is interpretable by humans through crowd-sourcing evaluations. Next, we propose a statistical model that represents the structural information of an object. Each object is described by the part location and location-independent signatures. They form a latent space on which a structural constraint is imposed. At inference time, the model produces the representations that maximize the posterior probability. We show that the new representation can achieve state-of-the-art performance for few-shot recognition on benchmark datasets.
We then study the problem of zero-shot detection. We propose an evaluation protocol and develop two algorithms to address the problem. One algorithm seamlessly integrates semantic attribute predictions into visual features to produce bounding boxes with visual and semantic information. In the second algorithm, we take an approach of data augmentation. First, a conditional variational auto-encoder is employed to produce synthetic features for unseen classes by leveraging the semantic attributes. The confidence predictor is then trained on the real data along with the synthetic features to predict higher confidence scores for unseen objects. Both algorithms show significant improvement in the detection of unseen objects through empirical evaluations on complex datasets.
Finally, we present a novel learning framework that associates each edge device in an IoT system with a gating function. The gating function can stop the device from transmitting redundant features to the central inference model for some instances at inference time. This framework can significantly reduce the energy cost by reducing the transmission counts with negligible accuracy degradation in our evaluations on real-world datasets. === 2023-09-26T00:00:00Z
|