Astraea: A Token-wise Acceleration Framework for Video Diffusion Transformers
This work addresses the practical deployment challenge of video generation models for users needing efficient inference, though it is incremental as it builds on existing acceleration methods.
The paper tackles the high computational demands of video diffusion transformers for text-to-video generation by introducing Astraea, a framework that achieves up to 2.4x inference speedup on a single GPU with minimal quality loss (e.g., <0.5% loss on VBench).
Video diffusion transformers (vDiTs) have made tremendous progress in text-to-video generation, but their high compute demands pose a major challenge for practical deployment. While studies propose acceleration methods to reduce workload at various granularities, they often rely on heuristics, limiting their applicability. We introduce Astraea, a framework that searches for near-optimal configurations for vDiT-based video generation under a performance target. At its core, Astraea proposes a lightweight token selection mechanism and a memory-efficient, GPU-friendly sparse attention strategy, enabling linear savings on execution time with minimal impact on generation quality. Meanwhile, to determine optimal token reduction for different timesteps, we further design a search framework that leverages a classic evolutionary algorithm to automatically determine the distribution of the token budget effectively. Together, Astraea achieves up to 2.4$\times$ inference speedup on a single GPU with great scalability (up to 13.2$\times$ speedup on 8 GPUs) while achieving up to over 10~dB video quality compared to the state-of-the-art methods ($<$0.5\% loss on VBench compared to baselines).