Can Large Vision Language Models Read Maps Like a Human?
This addresses the need for better benchmarks to assess LVLMs' spatial reasoning in navigation, though it is incremental as it focuses on dataset creation and evaluation rather than model improvement.
The paper tackles the problem of evaluating large vision-language models (LVLMs) on human-like map reading for outdoor navigation by introducing MapBench, a dataset of over 1600 pixel-based map path-finding problems, and finds that state-of-the-art LVLMs struggle significantly with spatial reasoning and structured decision-making in this task.
In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset in https://github.com/taco-group/MapBench.