RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing
This addresses the need for controllable scene generation in game development, architectural visualization, and embodied AI training, representing a significant but incremental improvement over prior approaches.
The authors tackled the problem of generating controllable and interactive indoor scenes by introducing RoomPilot, a framework that parses text or CAD floor plans into a domain-specific language for structured scene generation, resulting in superior physical consistency and visual fidelity compared to existing methods.
Generating controllable and interactive indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI training. Yet existing approaches either handle a narrow range of input modalities or rely on stochastic processes that hinder controllability. To overcome these limitations, we introduce RoomPilot, a unified framework that parses diverse multi-modal inputs--textual descriptions or CAD floor plans--into an Indoor Domain-Specific Language (IDSL) for indoor structured scene generation. The key insight is that a well-designed IDSL can act as a shared semantic representation, enabling coherent, high-quality scene synthesis from any single modality while maintaining interaction semantics. In contrast to conventional procedural methods that produce visually plausible but functionally inert layouts, RoomPilot leverages a curated dataset of interaction-annotated assets to synthesize environments exhibiting realistic object behaviors. Extensive experiments further validate its strong multi-modal understanding, fine-grained controllability in scene generation, and superior physical consistency and visual fidelity, marking a significant step toward general-purpose controllable 3D indoor scene generation.