CVJun 12, 2024

DiTFastAttn: Attention Compression for Diffusion Transformer Models

Zhihang Yuan, Hanling Zhang, Pu Lu, Xuefei Ning, Linfeng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, Yu Wang

arXiv:2406.08552v233.6107 citations

Originality Incremental advance

AI Analysis

This addresses efficiency issues for users of DiT models in image and video generation, representing an incremental improvement through compression techniques.

The paper tackled the computational bottleneck of Diffusion Transformers (DiT) in image and video generation by proposing DiTFastAttn, a post-training compression method that reduces attention FLOPs by up to 76% and achieves up to 1.8x end-to-end speedup for high-resolution generation.

Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT. We identify three key redundancies in the attention computation during DiT inference: (1) spatial redundancy, where many attention heads focus on local information; (2) temporal redundancy, with high similarity between the attention outputs of neighboring steps; (3) conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. We propose three techniques to reduce these redundancies: (1) Window Attention with Residual Sharing to reduce spatial redundancy; (2) Attention Sharing across Timesteps to exploit the similarity between steps; (3) Attention Sharing across CFG to skip redundant computations during conditional generation. We apply DiTFastAttn to DiT, PixArt-Sigma for image generation tasks, and OpenSora for video generation tasks. Our results show that for image generation, our method reduces up to 76% of the attention FLOPs and achieves up to 1.8x end-to-end speedup at high-resolution (2k x 2k) generation.

View on arXiv PDF

Similar