CVFeb 10

SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL

Yang Zhao, Shizhao Sun, Meisheng Zhang, Yingdong Shi, Xubo Yang, Jiang Bian

arXiv:2602.09432v12.84 citationsh-index: 5

Originality Highly original

AI Analysis

This addresses spatial reasoning challenges in 3D scene synthesis for applications like virtual reality or robotics, representing a novel method rather than an incremental improvement.

The paper tackles the problem of spatial hallucinations like collisions in 3D indoor scene synthesis by introducing SceneReVis, a self-reflective framework that uses a multi-turn RL approach to diagnose and resolve conflicts, achieving state-of-the-art performance in high-fidelity generation and goal-oriented optimization.

Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative ``diagnose-and-act'' loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.

View on arXiv PDF

Similar