CVApr 22

Fast-then-Fine: A Two-Stage Framework with Multi-Granular Representation for Cross-Modal Retrieval in Remote Sensing

arXiv:2604.2042979.5
Predicted impact top 29% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the challenge of efficient and accurate cross-modal retrieval for remote sensing imagery, which is incremental as it builds on prior methods by optimizing efficiency without sacrificing performance.

The paper tackles the problem of achieving both fine-grained cross-modal alignment and efficient retrieval in remote sensing image-text retrieval by proposing a fast-then-fine two-stage framework, which achieves competitive retrieval accuracy while significantly improving retrieval efficiency compared to existing methods.

Remote sensing (RS) image-text retrieval plays a critical role in understanding massive RS imagery. However, the dense multi-object distribution and complex backgrounds in RS imagery make it difficult to simultaneously achieve fine-grained cross-modal alignment and efficient retrieval. Existing methods either rely on complex cross-modal interactions that lead to low retrieval efficiency, or depend on large-scale vision-language model pre-training, which requires massive data and computational resources. To address these issues, we propose a fast-then-fine (FTF) two-stage retrieval framework that decomposes retrieval into a text-agnostic recall stage for efficient candidate selection and a text-guided rerank stage for fine-grained alignment. Specifically, in the recall stage, text-agnostic coarse-grained representations are employed for efficient candidate selection; in the rerank stage, a parameter-free balanced text-guided interaction block enhances fine-grained alignment without introducing additional learnable parameters. Furthermore, an inter- and intra-modal loss is designed to jointly optimize cross-modal alignment across multi-granular representations. Extensive experiments on public benchmarks demonstrate that the FTF achieves competitive retrieval accuracy while significantly improving retrieval efficiency compared with existing methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes