CVApr 3, 2024

Faster Diffusion via Temporal Attention Decomposition

Haozhe Liu, Wentian Zhang, Jinheng Xie, Francesco Faccio, Mengmeng Xu, Tao Xiang, Mike Zheng Shou, Juan-Manuel Perez-Rua, Jürgen Schmidhuber

arXiv:2404.02747v328.751 citationsh-index: 12Has Code

Originality Incremental advance

AI Analysis

This work addresses inference speed for users of text-to-image diffusion models, though it is incremental as it builds on existing attention mechanisms without altering core training.

The paper tackles the inefficiency of attention mechanisms in text-conditional diffusion models during inference by identifying that cross-attention converges early, leading to a training-free method called TGATE that caches and reuses attention outputs. Experimental results show TGATE accelerates various models by 10%-50%.

We explore the role of attention mechanism during inference in text-conditional diffusion models. Empirical observations suggest that cross-attention outputs converge to a fixed point after several inference steps. The convergence time naturally divides the entire inference process into two phases: an initial phase for planning text-oriented visual semantics, which are then translated into images in a subsequent fidelity-improving phase. Cross-attention is essential in the initial phase but almost irrelevant thereafter. However, self-attention initially plays a minor role but becomes crucial in the second phase. These findings yield a simple and training-free method known as temporally gating the attention (TGATE), which efficiently generates images by caching and reusing attention outputs at scheduled time steps. Experimental results show when widely applied to various existing text-conditional diffusion models, TGATE accelerates these models by 10%-50%. The code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE.

View on arXiv PDF Code

Similar