CVJun 3, 2025

ReSpace: Text-Driven 3D Indoor Scene Synthesis and Editing with Preference Alignment

arXiv:2506.02459v33 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses the need for more flexible and semantically rich 3D scene generation and editing tools in computer graphics, offering an incremental improvement over existing LLM-based methods by adding editing functionality and better handling of spatial constraints.

The paper tackles the problem of 3D indoor scene synthesis and editing by introducing ReSpace, a generative framework that uses autoregressive language models to enable text-driven creation and modification, achieving state-of-the-art results on object addition and superior human-perceived quality in full scene synthesis.

Scene synthesis and editing has emerged as a promising direction in computer graphics. Current trained approaches for 3D indoor scenes either oversimplify object semantics through one-hot class encodings (e.g., 'chair' or 'table'), require masked diffusion for editing, ignore room boundaries, or rely on floor plan renderings that fail to capture complex layouts. LLM-based methods enable richer semantics via natural language (e.g., 'modern studio with light wood furniture'), but lack editing functionality, are limited to rectangular layouts, or rely on weak spatial reasoning from implicit world models. We introduce ReSpace, a generative framework for text-driven 3D indoor scene synthesis and editing using autoregressive language models. Our approach features a compact structured scene representation with explicit room boundaries that enables asset-agnostic deployment and frames scene editing as a next-token prediction task. We leverage a dual-stage training approach combining supervised fine-tuning and preference alignment, enabling a specially trained language model for object addition that accounts for user instructions, spatial geometry, object semantics, and scene-level composition. For scene editing, we employ a zero-shot LLM to handle object removal and prompts for addition. We further introduce a voxelization-based evaluation capturing fine-grained geometry beyond 3D bounding boxes. Experimental results surpass state-of-the-art on addition and achieve superior human-perceived quality on full scene synthesis.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes