CVJan 29

RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning

arXiv:2601.21634v1h-index: 3
Originality Incremental advance
AI Analysis

This addresses the challenge of spatial reasoning in remote sensing for applications like environmental monitoring, though it is incremental as it builds on existing multimodal large language models.

The paper tackles the problem of remote sensing visual grounding, where models localize objects in aerial imagery from language descriptions, by proposing RSGround-R1, a reasoning-guided post-training framework that enhances spatial understanding, achieving superior performance on benchmarks.

Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions. Owing to the vast spatial scale and high semantic ambiguity of remote sensing scenes, these descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning. To leverage this unique feature, we propose a reasoning-guided, position-aware post-training framework, dubbed \textbf{RSGround-R1}, to progressively enhance spatial understanding. Specifically, we first introduce Chain-of-Thought Supervised Fine-Tuning (CoT-SFT) using synthetically generated RSVG reasoning data to establish explicit position awareness. Reinforcement Fine-Tuning (RFT) is then applied, augmented by our newly designed positional reward that provides continuous and distance-aware guidance toward accurate localization. Moreover, to mitigate incoherent localization behaviors across rollouts, we introduce a spatial consistency guided optimization scheme that dynamically adjusts policy updates based on their spatial coherence, ensuring stable and robust convergence. Extensive experiments on RSVG benchmarks demonstrate superior performance and generalization of our model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes