CVMar 11

Learning to Wander: Improving the Global Image Geolocation Ability of LMMs via Actionable Reasoning

arXiv:2603.10463v188.72 citationsh-index: 17
Predicted impact top 20% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses geolocation for embodied AI systems, offering a new paradigm for interactive reasoning, though it is incremental in applying existing reasoning methods to a new domain.

The paper tackles the problem of global image geolocation by introducing WanderBench, a benchmark with over 32K panoramas for interactive exploration, and GeoAoT, a framework that uses actionable reasoning to improve localization, achieving superior fine-grained accuracy and generalization in experiments with 19 large multimodal models.

Geolocation, the task of identifying the geographic location of an image, requires abundant world knowledge and complex reasoning abilities. Though advanced large multimodal models (LMMs) have shown superior aforementioned capabilities, their performance on the geolocation task remains unexplored. To this end, we introduce \textbf{WanderBench}, the first open access global geolocation benchmark designed for actionable geolocation reasoning in embodied scenarios. WanderBench contains over 32K panoramas across six continents, organized as navigable graphs that enable physical actions such as rotation and movement, transforming geolocation from static recognition into interactive exploration. Building on this foundation, we propose \textbf{GeoAoT} (Action of Thought), a \underline{Geo}location framework with \underline{A}ction of \underline{T}hough, which couples reasoning with embodied actions. Instead of generating textual reasoning chains, GeoAoT produces actionable plans such as, approaching landmarks or adjusting viewpoints, to actively reduce uncertainty. We further establish an evaluation protocol that jointly measures geolocation accuracy and difficulty-aware geolocation questioning ability. Experiments on 19 large multimodal models show that GeoAoT achieves superior fine-grained localization and stronger generalization in dynamic environments. WanderBench and GeoAoT define a new paradigm for actionable, reasoning driven geolocation in embodied visual understanding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes