CVAIMar 22, 2025

Progressive Prompt Detailing for Improved Alignment in Text-to-Image Generative Models

arXiv:2503.17794v46 citationsh-index: 25
Originality Highly original
AI Analysis

This addresses alignment issues in text-to-image generation for users needing detailed scene generation, representing an incremental improvement with a novel method for a known bottleneck.

The paper tackles the problem of text-to-image models struggling with long, complex prompts by proposing SCoPE, a training-free method that progressively refines prompts from coarse to fine details, achieving an average improvement of over +8 in VQA scores on 83% of prompts from the GenAI-Bench dataset compared to Stable Diffusion baselines.

Text-to-image generative models often struggle with long prompts detailing complex scenes, diverse objects with distinct visual characteristics and spatial relationships. In this work, we propose SCoPE (Scheduled interpolation of Coarse-to-fine Prompt Embeddings), a training-free method to improve text-to-image alignment by progressively refining the input prompt in a coarse-to-fine-grained manner. Given a detailed input prompt, we first decompose it into multiple sub-prompts which evolve from describing broad scene layout to highly intricate details. During inference, we interpolate between these sub-prompts and thus progressively introduce finer-grained details into the generated image. Our training-free plug-and-play approach significantly enhances prompt alignment, achieves an average improvement of more than +8 in Visual Question Answering (VQA) scores over the Stable Diffusion baselines on 83% of the prompts from the GenAI-Bench dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes