CVLGSep 30, 2025

Text-to-Scene with Large Reasoning Models

arXiv:2509.26091v23 citationsh-index: 24
Originality Highly original
AI Analysis

This addresses the challenge of creating detailed 3D environments from text for users in fields like design or gaming, representing a strong specific gain rather than a foundational advancement.

The paper tackles the problem of generating 3D scenes from text, where existing methods struggle with complex geometries and instructions, by introducing Reason-3D, which uses large reasoning models for object retrieval and placement, resulting in significant outperformance in human-rated metrics like visual fidelity and constraint adherence.

Prompt-driven scene synthesis allows users to generate complete 3D environments from textual descriptions. Current text-to-scene methods often struggle with complex geometries and object transformations, and tend to show weak adherence to complex instructions. We address these limitations by introducing Reason-3D, a text-to-scene model powered by large reasoning models (LRMs). Reason-3D integrates object retrieval using captions covering physical, functional, and contextual attributes. Reason-3D then places the selected objects based on implicit and explicit layout constraints, and refines their positions with collision-aware spatial reasoning. Evaluated on instructions ranging from simple to complex indoor configurations, Reason-3D significantly outperforms previous methods in human-rated visual fidelity, adherence to constraints, and asset retrieval quality. Beyond its contribution to the field of text-to-scene generation, our work showcases the advanced spatial reasoning abilities of modern LRMs. Additionally, we release the codebase to further the research in object retrieval and placement with LRMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes