CVSep 1, 2025

Bidirectional Sparse Attention for Faster Video Diffusion Training

arXiv:2509.01085v311 citationsh-index: 1
Originality Highly original
AI Analysis

This addresses the problem of high training and inference costs for researchers and practitioners working on high-resolution, long-duration video generation, representing a strong specific gain rather than a foundational breakthrough.

The paper tackles the computational inefficiency of video diffusion Transformer models by proposing a Bidirectional Sparse Attention framework, which reduces FLOPs by up to 20x and achieves 17.79x faster attention training while maintaining generative quality.

Video diffusion Transformer (DiT) models excel in generative quality but hit major computational bottlenecks when producing high-resolution, long-duration videos. The quadratic complexity of full attention leads to prohibitively high training and inference costs. Full attention inefficiency stems from two key challenges: excessive computation due to the inherent sparsity of Queries and Key-Value pairs, and redundant computation as fixed sparse patterns fail to leverage DiT's dynamic attention. To overcome this limitation, we propose a Bidirectional Sparse Attention (BSA) framework for faster video DiT training, the first to dynamically sparsify both Queries and Key-Value pairs within 3D full attention, thereby substantially improving training and inference efficiency. BSA addresses these issues through two key components. Query sparsity is optimized by selecting the most informative query tokens via semantic similarity and with a dynamic spatial-time training strategy, while KV sparsity is achieved by computing a statistical dynamic threshold to retain only the most salient KV blocks for computation. Extensive experiments demonstrate that BSA significantly accelerates DiT training across long sequences, reducing FLOPs by up to 20x and achieving 17.79x faster attention training, while preserving or even surpassing the generative quality of full attention.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes