CVJan 12

SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis

arXiv:2601.07218v11 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses the problem of generating realistic 3D indoor scenes from text for applications in virtual reality or design, representing an incremental advance over existing autoregressive and diffusion models.

The paper tackles 3D indoor scene synthesis from natural language instructions by proposing SceneNAT, a masked non-autoregressive Transformer that improves performance and efficiency over prior methods, achieving superior semantic compliance and spatial arrangement accuracy on the 3D-FRONT dataset with lower computational cost.

We present SceneNAT, a single-stage masked non-autoregressive Transformer that synthesizes complete 3D indoor scenes from natural language instructions through only a few parallel decoding passes, offering improved performance and efficiency compared to prior state-of-the-art approaches. SceneNAT is trained via masked modeling over fully discretized representations of both semantic and spatial attributes. By applying a masking strategy at both the attribute level and the instance level, the model can better capture intra-object and inter-object structure. To boost relational reasoning, SceneNAT employs a dedicated triplet predictor for modeling the scene's layout and object relationships by mapping a set of learnable relation queries to a sparse set of symbolic triplets (subject, predicate, object). Extensive experiments on the 3D-FRONT dataset demonstrate that SceneNAT achieves superior performance compared to state-of-the-art autoregressive and diffusion baselines in both semantic compliance and spatial arrangement accuracy, while operating with substantially lower computational cost.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes