CVNov 4, 2024

Training-free Regional Prompting for Diffusion Transformers

arXiv:2411.02395v128 citationsh-index: 17Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses a specific limitation in text-to-image generation for users of diffusion transformers, representing an incremental improvement by adapting existing regional prompting techniques to a new architecture.

The authors tackled the problem of diffusion transformers struggling with long, complex text prompts involving multiple objects and spatial relationships by proposing a training-free regional prompting method for FLUX.1, enabling fine-grained compositional image generation through attention manipulation.

Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and interrelated spatial relationships. While many regional prompting methods have been proposed for UNet-based models (SD1.5, SDXL), but there are still no implementations based on the recent Diffusion Transformer (DiT) architecture, such as SD3 and FLUX.1.In this report, we propose and implement regional prompting for FLUX.1 based on attention manipulation, which enables DiT with fined-grained compositional text-to-image generation capability in a training-free manner. Code is available at https://github.com/antonioo-c/Regional-Prompting-FLUX.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes