CV LGMay 7, 2025

Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

Divyansh Srivastava, Xiang Zhang, He Wen, Chenru Wen, Zhuowen Tu

arXiv:2505.04718v114.46 citationsh-index: 8Has Code

Originality Incremental advance

AI Analysis

This addresses the need for open-vocabulary and controllable scene layout generation in image synthesis and editing applications, though it appears incremental by building on existing diffusion and Transformer methods.

The paper tackles the problem of generating natural scene layouts from text prompts by proposing Lay-Your-Scene, a pipeline that uses lightweight open-source language models and a novel diffusion Transformer architecture, achieving state-of-the-art performance on spatial and numerical reasoning benchmarks.

We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary large language models for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-source language models to obtain scene elements from text prompts and a novel aspect-aware diffusion Transformer architecture trained in an open-vocabulary manner for conditional layout generation. Extensive experiments demonstrate that LayouSyn outperforms existing methods and achieves state-of-the-art performance on challenging spatial and numerical reasoning benchmarks. Additionally, we present two applications of LayouSyn. First, we show that coarse initialization from large language models can be seamlessly combined with our method to achieve better results. Second, we present a pipeline for adding objects to images, demonstrating the potential of LayouSyn in image editing applications.

View on arXiv PDF

Similar