CVApr 3, 2024

Faster Diffusion via Temporal Attention Decomposition

arXiv:2404.02747v350 citationsh-index: 12Has Code
Originality Incremental advance
AI Analysis

This work addresses inference speed for users of text-to-image diffusion models, though it is incremental as it builds on existing attention mechanisms without altering core training.

The paper tackles the inefficiency of attention mechanisms in text-conditional diffusion models during inference by identifying that cross-attention converges early, leading to a training-free method called TGATE that caches and reuses attention outputs. Experimental results show TGATE accelerates various models by 10%-50%.

We explore the role of attention mechanism during inference in text-conditional diffusion models. Empirical observations suggest that cross-attention outputs converge to a fixed point after several inference steps. The convergence time naturally divides the entire inference process into two phases: an initial phase for planning text-oriented visual semantics, which are then translated into images in a subsequent fidelity-improving phase. Cross-attention is essential in the initial phase but almost irrelevant thereafter. However, self-attention initially plays a minor role but becomes crucial in the second phase. These findings yield a simple and training-free method known as temporally gating the attention (TGATE), which efficiently generates images by caching and reusing attention outputs at scheduled time steps. Experimental results show when widely applied to various existing text-conditional diffusion models, TGATE accelerates these models by 10%-50%. The code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes