SLA2: Sparse-Linear Attention with Learnable Routing and QAT

Jintao Zhang, Haoxu Wang, Kai Jiang, Kaiwen Zheng, Youhe Jiang, Ion Stoica, Jianfei Chen, Jun Zhu, Joseph E. Gonzalez

Tsinghua

arXiv:2602.12675v110.211 citations

Originality Incremental advance

AI Analysis

This work addresses efficiency bottlenecks in video generation for AI researchers and practitioners, offering an incremental improvement over existing SLA methods.

The paper tackled the suboptimal heuristic split and attention error mismatch in Sparse-Linear Attention (SLA) for diffusion models by proposing SLA2 with a learnable router and a more faithful formulation, achieving 97% attention sparsity and an 18.6x attention speedup while preserving generation quality in video diffusion models.

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.

View on arXiv PDF

Similar