CV AIMay 29

ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models

Kaiwen Xue, Tao Wei, Guoxin Zhang, Zhonghong Ou, Kaoyan Lu, Yu Feng, Yifan Zhu, Haoran Luo

arXiv:2605.3125188.1Has Code

Predicted impact top 23% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This benchmark addresses the underexplored area of embodied geo-localization for multimodal large language models, providing a fine-grained evaluation framework.

This paper introduces ERGeoBench, a diagnostic benchmark for vision-driven embodied geo-localization, evaluating models across single-view, panorama-view, and embodied-view settings. Evaluations of leading MLLMs reveal they can infer high-level geographic semantics but struggle with fine-grained perceptual operations, metric localization, and spatial consistency.

Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a diagnostic benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressive settings -- single-view, panorama-view, and embodied-view -- where agents may actively acquire observations through sequential changes in yaw, pitch, and zoom. The benchmark contains 2,207 globally distributed street-view panoramas and measures four complementary capabilities: foundational perception, spatial awareness, common sense reasoning, and geo-localization reasoning. Evaluations of leading proprietary and open-source MLLMs show that current models can infer high-level geographic semantics, but still struggle with fine-grained perceptual operations, metric localization, and spatial consistency across views. We further observe that geo-localization is strongly correlated with the other capability dimensions, suggesting that accurate localization depends on integrated perception, spatial reasoning, and commonsense inference rather than isolated visual recognition. Overall, ERGeoBench provides a unified framework for diagnosing and advancing human-like embodied geo-localization. Project Page: https://kaixuewen.github.io/ERGeoBench/

View on arXiv PDF

Similar