CLCVMar 6

ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

arXiv:2603.06024v1h-index: 2
Predicted impact top 51% in CL · last 90 daysOriginality Incremental advance
AI Analysis

This addresses a specific bottleneck in multi-view reasoning for vision-language models, with incremental improvements in accuracy.

The paper tackled the problem of multi-view spatial reasoning in vision-language models, which often underutilize cross-view relations, by introducing ViewFusion, a two-stage framework that improved accuracy by 5.3% over Qwen3-VL-4B-Instruct on MMSI-Bench.

Multi-view spatial reasoning remains difficult for current vision-language models. Even when multiple viewpoints are available, models often underutilize cross-view relations and instead rely on single-image shortcuts, leading to fragile performance on viewpoint transformation and occlusion-sensitive cases. We present ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering. In the first stage, the model performs deliberate spatial pre-thinking to infer viewpoint relations and spatial transformations across views, forming an intermediate workspace that goes beyond a simple re-description. In the second stage, the model conducts question-driven reasoning conditioned on this workspace to produce the final prediction. We train ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intended two-stage generation behavior. On MMSI-Bench, ViewFusion improves accuracy by 5.3\% over Qwen3-VL-4B-Instruct, with the largest gains on examples that require genuine cross-view alignment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes