CVMar 22, 2025

Towards Transformer-Based Aligned Generation with Self-Coherence Guidance

arXiv:2503.17675v114 citationsh-index: 10Has CodeCVPR
Originality Incremental advance
AI Analysis

This work addresses alignment issues in text-to-image generation for users needing precise control over complex prompts, representing an incremental improvement over existing methods.

The paper tackles the problem of poor semantic alignment in Transformer-based text-guided diffusion models by introducing a training-free method that optimizes cross-attention maps during generation, achieving superior performance on challenging benchmarks for attribute and style binding.

We introduce a novel, training-free approach for enhancing alignment in Transformer-based Text-Guided Diffusion Models (TGDMs). Existing TGDMs often struggle to generate semantically aligned images, particularly when dealing with complex text prompts or multi-concept attribute binding challenges. Previous U-Net-based methods primarily optimized the latent space, but their direct application to Transformer-based architectures has shown limited effectiveness. Our method addresses these challenges by directly optimizing cross-attention maps during the generation process. Specifically, we introduce Self-Coherence Guidance, a method that dynamically refines attention maps using masks derived from previous denoising steps, ensuring precise alignment without additional training. To validate our approach, we constructed more challenging benchmarks for evaluating coarse-grained attribute binding, fine-grained attribute binding, and style binding. Experimental results demonstrate the superior performance of our method, significantly surpassing other state-of-the-art methods across all evaluated tasks. Our code is available at https://scg-diffusion.github.io/scg-diffusion.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes