CVGRLGAug 24, 2023

Dense Text-to-Image Generation with Attention Modulation

arXiv:2308.12964v1207 citationsh-index: 23
Originality Incremental advance
AI Analysis

This addresses a specific challenge in image generation for detailed scene descriptions, but it is incremental as it builds on existing pre-trained models without new training.

The paper tackles the problem of text-to-image diffusion models struggling with dense captions by proposing DenseDiffusion, a training-free method that uses attention modulation to guide object placement based on layout guidance, improving performance in automatic and human evaluations and matching models trained with layout conditions.

Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions, where each text prompt provides a detailed description for a specific image region. To address this, we propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout. We first analyze the relationship between generated images' layouts and the pre-trained model's intermediate attention maps. Next, we develop an attention modulation method that guides objects to appear in specific regions according to layout guidance. Without requiring additional fine-tuning or datasets, we improve image generation performance given dense captions regarding both automatic and human evaluation scores. In addition, we achieve similar-quality visual results with models specifically trained with layout conditions.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes