CV AIFeb 15

GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yifan Zhang, Long Lan, Xue Yang, Hongda Sun, Yulin Wang, Di Wang, Jun Song, Jing Zhang

arXiv:2602.14201v18.55 citationsh-index: 1

Originality Highly original

AI Analysis

This addresses the challenge of evidence-grounded understanding in remote sensing imagery for AI applications, representing an incremental improvement with a novel method for a known bottleneck.

The paper tackled the problem of tool usage homogenization in zoom-enabled multimodal large language models for ultra-high-resolution remote sensing VQA, where existing methods fail to effectively acquire evidence, and proposed GeoEyes, a staged training framework that achieved 54.23% accuracy on XLRS-Bench.

The "thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substantial improvements on UHR remote sensing benchmarks, with 54.23% accuracy on XLRS-Bench.

View on arXiv PDF

Similar