CVJan 23

SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer

Tsinghua
arXiv:2601.16515v14 citationsh-index: 26
Originality Incremental advance
AI Analysis

This addresses the computational bottleneck in video generation for AI researchers and practitioners, offering an efficient tuning method with incremental improvements over existing sparse attention techniques.

The paper tackled the high computational latency in video diffusion transformers due to quadratic attention complexity by proposing SALAD, which achieves 90% sparsity and a 1.72x inference speedup while maintaining comparable generation quality to full attention.

Diffusion Transformers have recently demonstrated remarkable performance in video generation. However, the long input sequences result in high computational latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free sparse attention is constrained by limited sparsity and thus offers modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation for training. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. By incorporating an input-dependent gating mechanism to finely balance the two branches, our method attains 90% sparsity and 1.72x inference speedup, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is highly efficient, requiring only 2,000 video samples and 1,600 training steps with a batch size of 8.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes