CVJul 15, 2025

Combining Transformers and CNNs for Efficient Object Detection in High-Resolution Satellite Imagery

Nicolas Drapier, Aladine Chetouani, Aurélien Chateigner

arXiv:2507.11040v16.22 citationsh-index: 142025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

Originality Incremental advance

AI Analysis

This work addresses the problem of efficient object detection in satellite imagery for remote sensing applications, representing a strong domain-specific advancement.

The paper tackles object detection in high-resolution satellite imagery by introducing GLOD, a transformer-first architecture that replaces CNN backbones with a Swin Transformer and incorporates novel upsampling and fusion blocks. It achieves 32.95% on the xView benchmark, outperforming state-of-the-art methods by 11.46%.

We present GLOD, a transformer-first architecture for object detection in high-resolution satellite imagery. GLOD replaces CNN backbones with a Swin Transformer for end-to-end feature extraction, combined with novel UpConvMixer blocks for robust upsampling and Fusion Blocks for multi-scale feature integration. Our approach achieves 32.95\% on xView, outperforming SOTA methods by 11.46\%. Key innovations include asymmetric fusion with CBAM attention and a multi-path head design capturing objects across scales. The architecture is optimized for satellite imagery challenges, leveraging spatial priors while maintaining computational efficiency.

View on arXiv PDF

Similar