CVFeb 6

Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers

arXiv:2602.06886v2h-index: 6
Originality Incremental advance
AI Analysis

This addresses a specific bottleneck in text-to-image models for users needing better prompt adherence, though it is incremental as it builds on existing MMDiT architectures.

The paper tackles prompt forgetting in Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation, where prompt semantics degrade with depth, and introduces a training-free prompt reinjection method that improves instruction-following and quality metrics across benchmarks.

Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs--SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text--image generation quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes