Combining Transformers and CNNs for Efficient Object Detection in High-Resolution Satellite Imagery
This work addresses the problem of efficient object detection in satellite imagery for remote sensing applications, representing a strong domain-specific advancement.
The paper tackles object detection in high-resolution satellite imagery by introducing GLOD, a transformer-first architecture that replaces CNN backbones with a Swin Transformer and incorporates novel upsampling and fusion blocks. It achieves 32.95% on the xView benchmark, outperforming state-of-the-art methods by 11.46%.
We present GLOD, a transformer-first architecture for object detection in high-resolution satellite imagery. GLOD replaces CNN backbones with a Swin Transformer for end-to-end feature extraction, combined with novel UpConvMixer blocks for robust upsampling and Fusion Blocks for multi-scale feature integration. Our approach achieves 32.95\% on xView, outperforming SOTA methods by 11.46\%. Key innovations include asymmetric fusion with CBAM attention and a multi-path head design capturing objects across scales. The architecture is optimized for satellite imagery challenges, leveraging spatial priors while maintaining computational efficiency.