AICVMay 22, 2025

Bridging the Dynamic Perception Gap: Training-Free Draft Chain-of-Thought for Dynamic Multimodal Spatial Reasoning

arXiv:2505.16579v13 citationsh-index: 10Has CodeEMNLP
Originality Incremental advance
AI Analysis

This work addresses dynamic spatial reasoning for MLLMs, offering a robust baseline that is incremental over existing CoT methods.

The paper tackles the problem of dynamic spatial reasoning in multimodal large language models (MLLMs) by introducing a training-free framework called D2R, which integrates textual chains-of-thought with visual drafts, and it shows consistent performance enhancements across diverse tasks without fine-tuning.

While chains-of-thought (CoT) have advanced complex reasoning in multimodal large language models (MLLMs), existing methods remain confined to text or static visual domains, often faltering in dynamic spatial reasoning tasks. To bridge this gap, we present GRASSLAND, a novel maze navigation benchmark designed to evaluate dynamic spatial reasoning. Our experiments show that augmenting textual reasoning chains with dynamic visual drafts, overlaid on input images, significantly outperforms conventional approaches, offering new insights into spatial reasoning in evolving environments. To generalize this capability, we propose D2R (Dynamic Draft-Augmented Reasoning), a training-free framework that seamlessly integrates textual CoT with corresponding visual drafts into MLLMs. Extensive evaluations demonstrate that D2R consistently enhances performance across diverse tasks, establishing a robust baseline for dynamic spatial reasoning without requiring model fine-tuning. Project is open at https://github.com/Cratileo/D2R.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes