Summary: | The classification performance of aerial scenes relies heavily on the discriminative power of feature representation from high-spatial resolution remotely sensed imagery. The convolutional neural networks (CNNs) have recently been applied to adaptively learn image features at different levels of abstraction rather than requiring handcrafted features and achieved state-of-the-art performance. However, most of these networks focus on multi-stage global feature learning yet neglect the local information, which plays an important role in scene recognition. To address this issue, a novel end-to-end global-local attention network (GLANet) is proposed to capture both global and local information for aerial scene classification. FC layers in the VGGNet are replaced by the global attention (GA) branch and local attention (LA) branch, one of which learns the global information while the other learns the local semantic information via attention mechanisms. During each training, the labels of input images can be predicted by the local, global, and their concatenated features using softmax. According to different predicted labels, two auxiliary loss functions are further computed and imposed on the proposed network to enhance the supervision for network learning. The experimental results on three challenging large-scale scene datasets demonstrate the effectiveness of the proposed global-local attention network.
|