S2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation
This work addresses the challenge of efficient video generation for mobile applications, offering a practical solution for on-device streaming, though it is incremental as it builds upon existing Diffusion Transformer methods.
The paper tackled the problem of heavy computational cost in Diffusion Transformers for video generation, which hinders real-time or on-device use, and introduced S2DiT to achieve high-fidelity streaming video generation on mobile hardware, resulting in quality comparable to state-of-the-art server models while streaming at over 10 FPS on an iPhone.
Diffusion Transformers (DiTs) have recently improved video generation quality. However, their heavy computational cost makes real-time or on-device generation infeasible. In this work, we introduce S2DiT, a Streaming Sandwich Diffusion Transformer designed for efficient, high-fidelity, and streaming video generation on mobile hardware. S2DiT generates more tokens but maintains efficiency with novel efficient attentions: a mixture of LinConv Hybrid Attention (LCHA) and Stride Self-Attention (SSA). Based on this, we uncover the sandwich design via a budget-aware dynamic programming search, achieving superior quality and efficiency. We further propose a 2-in-1 distillation framework that transfers the capacity of large teacher models (e.g., Wan 2.2-14B) to the compact few-step sandwich model. Together, S2DiT achieves quality on par with state-of-the-art server video models, while streaming at over 10 FPS on an iPhone.