CVMay 25

AI-T2I: Aggregating-and-Isolating Cross-Attention to Diffusion Models for Text-to-Image Synthesis

Shipeng Cao, Biao Qian, Haipeng Liu, Yang Wang, Meng Wang

arXiv:2605.2576367.4

Predicted impact top 47% in CV · last 90 daysOriginality Incremental advance

AI Analysis

Improves text-to-image alignment for diffusion model users, addressing a known bottleneck in cross-attention mechanisms.

AI-T2I addresses the intra-subject-token activation scattering issue in diffusion models for text-to-image synthesis, achieving state-of-the-art alignment and generalization to layout and personalized generation.

Text-to-image synthesis has made significant progress, benefiting from the strong generative capabilities of diffusion models. However, these models struggle to achieve precise text-to-image alignment within cross-attention maps during the denoising process. Existing works primarily focus on inter-subject-token activations (i.e., cross-attention scores) overlap for different subjects, overlooking the intra-subject-token activations scattering issue for identical subjects. In this paper, we propose an Aggregating-and-Isolating cross-attention approach to diffusion models for Text-to-Image synthesis, dubbed AI-T2I. Technically, to address the scattering issue, we devise an aggregation loss to identify and consolidate the scattered intra-token activations, which implicitly helps mitigate the potential overlap issue. Upon that, an isolation loss is further introduced to push the inter-token activations apart, thus fulfilling precise text-to-image alignment. Extensive experiments on various benchmarks demonstrate the superiority of AI-T2I over the state-of-the-art works for text-to-image synthesis. Furthermore, our AI-T2I exhibits excellent generalization across other tasks, e.g., controllable layout generation and personalized generation.

View on arXiv PDF

Similar