SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis
This addresses the problem of generating realistic 3D indoor scenes from text for applications in virtual reality or design, representing an incremental advance over existing autoregressive and diffusion models.
The paper tackles 3D indoor scene synthesis from natural language instructions by proposing SceneNAT, a masked non-autoregressive Transformer that improves performance and efficiency over prior methods, achieving superior semantic compliance and spatial arrangement accuracy on the 3D-FRONT dataset with lower computational cost.
We present SceneNAT, a single-stage masked non-autoregressive Transformer that synthesizes complete 3D indoor scenes from natural language instructions through only a few parallel decoding passes, offering improved performance and efficiency compared to prior state-of-the-art approaches. SceneNAT is trained via masked modeling over fully discretized representations of both semantic and spatial attributes. By applying a masking strategy at both the attribute level and the instance level, the model can better capture intra-object and inter-object structure. To boost relational reasoning, SceneNAT employs a dedicated triplet predictor for modeling the scene's layout and object relationships by mapping a set of learnable relation queries to a sparse set of symbolic triplets (subject, predicate, object). Extensive experiments on the 3D-FRONT dataset demonstrate that SceneNAT achieves superior performance compared to state-of-the-art autoregressive and diffusion baselines in both semantic compliance and spatial arrangement accuracy, while operating with substantially lower computational cost.