CVApr 8

Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning

arXiv:2604.0672593.7
Predicted impact top 11% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This addresses the limitation of MLLMs in complex 3D spatial reasoning, which is crucial for applications requiring multi-perspective understanding, though it appears incremental as it builds on existing 3D reconstruction and viewpoint synthesis techniques.

The paper tackles the problem of Multimodal Large Language Models struggling with 3D spatial reasoning by proposing a training-free framework that uses 3D reconstruction and novel view synthesis to enhance spatial comprehension, achieving superior performance over specialized spatial models and general-purpose MLLMs on benchmarks like 3DSRBench and Rel3D.

Although Multimodal Large Language Models have achieved remarkable progress, they still struggle with complex 3D spatial reasoning due to the reliance on 2D visual priors. Existing approaches typically mitigate this limitation either through computationally expensive post-training procedures on limited 3D datasets or through rigid tool-calling mechanisms that lack explicit geometric understanding and viewpoint flexibility. To address these challenges, we propose a \textit{training-free} framework that introduces a Visual Chain-of-Thought mechanism grounded in explicit 3D reconstruction. The proposed pipeline first reconstructs a high-fidelity 3D mesh from a single image using MLLM-guided keyword extraction and mask generation at multiple granularities. Subsequently, the framework leverages an external knowledge base to iteratively compute optimal camera extrinsic parameters and synthesize novel views, thereby emulating human perspective-taking. Extensive experiments demonstrate that the proposed approach significantly enhances spatial comprehension. Specifically, the framework outperforms specialized spatial models and general-purpose MLLMs, including \textit{GPT-5.2} and \textit{Gemini-2.5-Flash}, on major benchmarks such as 3DSRBench and Rel3D.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes