CVAIApr 17

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

arXiv:2604.1606085.51 citationsh-index: 12
Predicted impact top 21% in CV · last 90 daysOriginality Incremental advance
AI Analysis

Identifies a critical limitation of current multimodal reasoning models for spatial intelligence tasks, challenging the use of text-only CoT for such problems.

The paper shows that Chain-of-Thought prompting degrades performance in visual spatial reasoning across 17 models and 13 benchmarks, and reveals that models hallucinate visual details from textual priors even without images.

Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes