CVMar 15, 2025

STAY Diffusion: Styled Layout Diffusion Model for Diverse Layout-to-Image Generation

arXiv:2503.12213v15 citationsh-index: 2WACV
Originality Incremental advance
AI Analysis

This work addresses the problem of generating controlled complex scenes from layouts for downstream applications, offering incremental improvements in guidance and control.

The paper tackles layout-to-image synthesis by proposing STAY Diffusion, a diffusion-based model that generates photo-realistic images with fine-grained control over stylized objects, achieving state-of-the-art results in diversity, accuracy, and controllability.

In layout-to-image (L2I) synthesis, controlled complex scenes are generated from coarse information like bounding boxes. Such a task is exciting to many downstream applications because the input layouts offer strong guidance to the generation process while remaining easily reconfigurable by humans. In this paper, we proposed STyled LAYout Diffusion (STAY Diffusion), a diffusion-based model that produces photo-realistic images and provides fine-grained control of stylized objects in scenes. Our approach learns a global condition for each layout, and a self-supervised semantic map for weight modulation using a novel Edge-Aware Normalization (EA Norm). A new Styled-Mask Attention (SM Attention) is also introduced to cross-condition the global condition and image feature for capturing the objects' relationships. These measures provide consistent guidance through the model, enabling more accurate and controllable image generation. Extensive benchmarking demonstrates that our STAY Diffusion presents high-quality images while surpassing previous state-of-the-art methods in generation diversity, accuracy, and controllability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes