CVIRDec 12, 2025

VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing

arXiv:2512.11490v1h-index: 3
Originality Incremental advance
AI Analysis

This addresses the need for cohesive multimodal analysis in remote sensing by unifying scalable retrieval with region-level spatial reasoning, though it appears incremental as it builds on existing vision-language model paradigms.

The paper tackles the problem of fragmented multimodal approaches in remote sensing by proposing VLM2GeoVec, a single-encoder vision-language model that embeds interleaved inputs in a unified vector space, achieving significant improvements such as 26.6% P@1 on region-caption retrieval (+25 percentage points vs. baselines) and 17.8% P@1 on semantic geo-localization retrieval (over 3× prior best).

Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $\textbf{VLM2GeoVec}$, an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce $\textbf{RSMEB}$, a novel benchmark covering key remote-sensing embedding applications: scene classification; cross-modal search; compositional retrieval; visual-question answering; visual grounding and region-level reasoning; and semantic geospatial retrieval. On RSMEB, it achieves $\textbf{26.6%}$ P@1 on region-caption retrieval (+25 pp vs. dual-encoder baselines), $\textbf{32.5%}$ P@1 on referring-expression retrieval (+19 pp), and $\textbf{17.8%}$ P@1 on semantic geo-localization retrieval (over $3\times$ prior best), while matching or exceeding specialized baselines on conventional tasks such as scene classification and cross-modal retrieval. VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. We will publicly release the code, checkpoints, and data upon acceptance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes