CVJul 15, 2025

Combining Transformers and CNNs for Efficient Object Detection in High-Resolution Satellite Imagery

arXiv:2507.11040v12 citationsh-index: 142025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
Originality Incremental advance
AI Analysis

This work addresses the problem of efficient object detection in satellite imagery for remote sensing applications, representing a strong domain-specific advancement.

The paper tackles object detection in high-resolution satellite imagery by introducing GLOD, a transformer-first architecture that replaces CNN backbones with a Swin Transformer and incorporates novel upsampling and fusion blocks. It achieves 32.95% on the xView benchmark, outperforming state-of-the-art methods by 11.46%.

We present GLOD, a transformer-first architecture for object detection in high-resolution satellite imagery. GLOD replaces CNN backbones with a Swin Transformer for end-to-end feature extraction, combined with novel UpConvMixer blocks for robust upsampling and Fusion Blocks for multi-scale feature integration. Our approach achieves 32.95\% on xView, outperforming SOTA methods by 11.46\%. Key innovations include asymmetric fusion with CBAM attention and a multi-path head design capturing objects across scales. The architecture is optimized for satellite imagery challenges, leveraging spatial priors while maintaining computational efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes