CVJun 1

Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agentic Vision-Language Models

arXiv:2606.0245983.7Has Code
AI Analysis

For researchers working on spatial reasoning in VLMs, this work provides a novel agentic pipeline that significantly improves performance on a challenging benchmark.

The paper tackles spatial reasoning in Vision-Language Models by introducing a dynamic cognitive map and Spatial Assertion Codes for dense rewards, achieving 80.5% overall accuracy on the MindCube benchmark, outperforming the best current method by 29.5 points (53.2% relative improvement) on the Rotation subset.

Enabling Vision-Language Models (VLMs) to perform spatial reasoning remains challenging. Existing approaches treat VLMs as passive observers, which is difficult for real-world applications. Moreover, reinforcement learning methods rely on sparse rewards, limiting their effectiveness for complex reasoning tasks. Inspired by pigeons' building and exploiting cognitive maps for navigation, we propose a novel agentic pipeline for spatial reasoning. First, we introduce a new \emph{dynamic cognitive map} parameterizing scene layout as object positions and orientations, serving as persistent memory for new observations. Second, we propose a novel \emph{Spatial Assertion Codes (SAC)}, Python expressions programmatically describing spatial relationships. By collaborating with the dynamic cognitive map, SAC enables verification of intermediate reasoning steps, providing dense reward signals. We optimize the model via supervised and reinforcement finetuning. Experiments on the MindCube benchmark demonstrate state-of-the-art performance with \emph{80.5\%} overall accuracy, outperforming the best current method by \emph{29.5} accuracy points (a relative improvement of \emph{53.2\%}) on the challenging \textsc{Rotation} subset. Our code and data are open-sourced at https://github.com/dw-dengwei/active-spatial-reasoning.git.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes