AdaptiveLoad: Towards Efficient Video Diffusion Transformer Training

Yucheng Guo, Yongjian Guo, Zhong Guan, Haoran Sun, Wen Huang, Wanting Xu, Jing Long, Shuai Di, Junwu Xiong

arXiv:2605.1792330.0

Predicted impact top 5% in DC · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners training large-scale video diffusion models, this work provides a practical optimization to improve GPU utilization and throughput.

AdaptiveLoad addresses load imbalance in video diffusion Transformer training caused by sequence length variance, reducing computational imbalance from 39% to 18.9%, improving peak VRAM utilization by 22.7%, and increasing training throughput by 27.2% on the Wan 2.1 world model.

In video generation models, particularly world models, training large-scale video diffusion Transformers (such as DiT and MMDiT) poses significant computational challenges due to the extreme variance in sequence lengths within mixed-mode datasets. Existing bucket-based data loading strategies typically rely on "equal token length" constraints. This approach fails to account for the quadratic complexity of self-attention mechanisms, leading to severe load imbalance and underutilization of GPU resources. This paper proposes \textit{AdaptiveLoad}, an integrated optimization framework consisting of two core components: (1) A dual-constraint adaptive load balancing system, which eliminates long-sequence bottlenecks by simultaneously limiting memory consumption and computational load ($B \times S^p \le M_{\text{comp}}$); (2) A fused LayerNorm-Modulate CUDA kernel, which utilizes a D-tile coalesced reduction strategy to increase throughput and alleviate memory pressure. Experimental results on the Wan 2.1 world model demonstrate that our method reduces the computational imbalance rate from 39\% to 18.9\%, improves peak VRAM utilization efficiency by 22.7\%, and achieves an overall training throughput increase of 27.2\%.

View on arXiv PDF

Similar