CV AI CL LGJan 29

GeoRC: A Benchmark for Geolocation Reasoning Chains

Mohit Talreja, Joshua Diao, Jim Thannikary James, Radu Casapu, Tejas Santanam, Ethan Mendes, Alan Ritter, Wei Xu, James Hays

arXiv:2601.21278v12.81 citationsh-index: 7

Originality Incremental advance

AI Analysis

This addresses the issue of VLM interpretability and reliability in geolocation tasks for researchers and practitioners, though it is incremental as it focuses on benchmarking rather than solving the underlying problem.

The paper tackles the problem that Vision Language Models (VLMs) often produce inaccurate or hallucinated reasoning chains for geolocation predictions, even when predictions are correct, by introducing the first benchmark for geolocation reasoning chains. The result shows that while large closed-source VLMs rival human experts in prediction accuracy, they lag in producing auditable reasoning chains, with open-weight VLMs performing only slightly better than a baseline that hallucinates chains without visual information.

Vision Language Models (VLMs) are good at recognizing the global location of a photograph -- their geolocation prediction accuracy rivals the best human experts. But many VLMs are startlingly bad at explaining which image evidence led to their prediction, even when their location prediction is correct. The reasoning chains produced by VLMs frequently hallucinate scene attributes to support their location prediction (e.g. phantom writing, imagined infrastructure, misidentified flora). In this paper, we introduce the first benchmark for geolocation reasoning chains. We focus on the global location prediction task in the popular GeoGuessr game which draws from Google Street View spanning more than 100 countries. We collaborate with expert GeoGuessr players, including the reigning world champion, to produce 800 ground truth reasoning chains for 500 query scenes. These expert reasoning chains address hundreds of different discriminative visual attributes such as license plate shape, architecture, and soil properties to name just a few. We evaluate LLM-as-a-judge and VLM-as-a-judge strategies for scoring VLM-generated reasoning chains against our expert reasoning chains and find that Qwen 3 LLM-as-a-judge correlates best with human scoring. Our benchmark reveals that while large, closed-source VLMs such as Gemini and GPT 5 rival human experts at prediction locations, they still lag behind human experts when it comes to producing auditable reasoning chains. Open weights VLMs such as Llama and Qwen catastrophically fail on our benchmark -- they perform only slightly better than a baseline in which an LLM hallucinates a reasoning chain with oracle knowledge of the photo location but no visual information at all. We believe the gap between human experts and VLMs on this task points to VLM limitations at extracting fine-grained visual attributes from high resolution images.

View on arXiv PDF

Similar