CVJan 13, 2025

Enhancing Image Generation Fidelity via Progressive Prompts

arXiv:2501.07070v19 citationsh-index: 7ICASSP
Originality Incremental advance
AI Analysis

This addresses regional prompt-following for image generation users, but is incremental as it builds on existing DiT methods.

The paper tackles the problem of limited regional prompt control in diffusion transformer (DiT) image generation by proposing a coarse-to-fine pipeline that uses LLMs to generate high- and low-level descriptions and injects them via regional cross-attention control, enhancing controllability and improving generated image performance.

The diffusion transformer (DiT) architecture has attracted significant attention in image generation, achieving better fidelity, performance, and diversity. However, most existing DiT - based image generation methods focus on global - aware synthesis, and regional prompt control has been less explored. In this paper, we propose a coarse - to - fine generation pipeline for regional prompt - following generation. Specifically, we first utilize the powerful large language model (LLM) to generate both high - level descriptions of the image (such as content, topic, and objects) and low - level descriptions (such as details and style). Then, we explore the influence of cross - attention layers at different depths. We find that deeper layers are always responsible for high - level content control, while shallow layers handle low - level content control. Various prompts are injected into the proposed regional cross - attention control for coarse - to - fine generation. By using the proposed pipeline, we enhance the controllability of DiT - based image generation. Extensive quantitative and qualitative results show that our pipeline can improve the performance of the generated images.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes