CVJan 19

S2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation

arXiv:2601.12719v1
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficient video generation for mobile applications, offering a practical solution for on-device streaming, though it is incremental as it builds upon existing Diffusion Transformer methods.

The paper tackled the problem of heavy computational cost in Diffusion Transformers for video generation, which hinders real-time or on-device use, and introduced S2DiT to achieve high-fidelity streaming video generation on mobile hardware, resulting in quality comparable to state-of-the-art server models while streaming at over 10 FPS on an iPhone.

Diffusion Transformers (DiTs) have recently improved video generation quality. However, their heavy computational cost makes real-time or on-device generation infeasible. In this work, we introduce S2DiT, a Streaming Sandwich Diffusion Transformer designed for efficient, high-fidelity, and streaming video generation on mobile hardware. S2DiT generates more tokens but maintains efficiency with novel efficient attentions: a mixture of LinConv Hybrid Attention (LCHA) and Stride Self-Attention (SSA). Based on this, we uncover the sandwich design via a budget-aware dynamic programming search, achieving superior quality and efficiency. We further propose a 2-in-1 distillation framework that transfers the capacity of large teacher models (e.g., Wan 2.2-14B) to the compact few-step sandwich model. Together, S2DiT achieves quality on par with state-of-the-art server video models, while streaming at over 10 FPS on an iPhone.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes