Summary: | In the computer vision community, the general trend has been to capture and select discriminative features in order to yield significantly better performance. Recent advances in attention mechanism proposed several attention blocks to adaptively recalibrate the feature response. However, most of them overlooked the context information at a multi-scale level. In this paper, we propose a simple yet effective building block for ResNeXt-style backbones, namely discriminative local representation (DLR) module, which allows discriminative local representation learning for multi-scale feature information across multi-parallel branches. Our DLR module contains two sub-modules: channel selective module (CSM) and spatial selective module (SSM). Given an intermediate feature map, the CSM first selectively generates the channel-wise attention maps and recalibrates the response from different branches according to the weight vector calculated by softmax layer. And then, the SSM further captures the spatial discriminative information at different scales respectively and emphasizes the interdependent channel maps. Besides, we place a high-order item during the process of multi-branch fusion and residual connection to enhance the intensity of structure nonlinearity. Various DLR modules can be stacked to a deep convolution network named DLRNet. To validate our DLRNet, we conduct comprehensive experiments on classification benchmarks (i.e. CIFAR10, CIFAR100 and ImageNet-1K), as well as two publicly available fine-grained datasets (i.e. CUB-200-2011 and Stanford Dogs). The experiments show consistent improvement gains over previous baseline models with reasonable overhead, and demonstrate the capability of our proposed method for discriminative local representation.
|