GeoChain: Multimodal Chain-of-Thought for Geographic Reasoning
This provides a diagnostic tool for researchers and developers to improve geographic reasoning in AI models, though it is incremental as it focuses on benchmarking rather than proposing new methods.
The paper tackles the problem of evaluating step-by-step geographic reasoning in multimodal large language models by introducing GeoChain, a large-scale benchmark with 1.46 million images and over 30 million Q&A pairs, and finds that models like GPT-4.1 and Claude 3.7 struggle with visual grounding and accurate localization as complexity increases.
This paper introduces GeoChain, a large-scale benchmark for evaluating step-by-step geographic reasoning in multimodal large language models (MLLMs). Leveraging 1.46 million Mapillary street-level images, GeoChain pairs each image with a 21-step chain-of-thought (CoT) question sequence (over 30 million Q&A pairs). These sequences guide models from coarse attributes to fine-grained localization across four reasoning categories - visual, spatial, cultural, and precise geolocation - annotated by difficulty. Images are also enriched with semantic segmentation (150 classes) and a visual locatability score. Our benchmarking of contemporary MLLMs (GPT-4.1 variants, Claude 3.7, Gemini 2.5 variants) on a diverse 2,088-image subset reveals consistent challenges: models frequently exhibit weaknesses in visual grounding, display erratic reasoning, and struggle to achieve accurate localization, especially as the reasoning complexity escalates. GeoChain offers a robust diagnostic methodology, critical for fostering significant advancements in complex geographic reasoning within MLLMs.