From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning
For researchers evaluating multimodal models, this work reveals that high performance on visual planning tasks can be misleading, as models rely on token-level search rather than human-like spatial reasoning.
The paper introduces MazeBench, a benchmark of 110 procedurally generated maze images, and evaluates 16 multimodal model configurations. It finds that high accuracy (e.g., GPT-5.4 at 91%) is achieved through brute-force token-level search rather than genuine visual planning, as models translate images to text grids and enumerate paths, consuming up to 22,818 tokens per solve.
How do multimodal models solve visual spatial tasks -- through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91\% and Gemini 3.1 Pro 79\%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710--22,818 tokens per solve for a task humans do quickly. Without added reasoning budgets, all configurations score only 2--12\%; on 20$\times$20 ultra-hard mazes, they hit token limits and fail. Qualitative traces reveal a common two-stage strategy: image-to-grid translation followed by token-level search, effectively BFS in prose. A text-grid ablation shows Claude Sonnet 4.6 rising from 6\% on images to 80\% when given the correct grid, isolating weak visual extraction from downstream search. When explicitly instructed not to construct a grid or perform graph search, models still revert to the same enumeration strategy. \textsc{MazeBench} therefore shows that high accuracy on visual planning tasks does not imply human-like spatial understanding.