Taming Diffusion Transformer for Efficient Mobile Video Generation in Seconds
This work addresses the challenge of efficient, high-quality video generation for resource-constrained mobile platforms, representing an incremental improvement with practical deployment implications.
The paper tackled the problem of high computational cost in Diffusion Transformers for video generation on mobile devices by proposing novel optimizations, resulting in a model that achieves approximately 15 frames per second on an iPhone 16 Pro Max.
Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones, and practical on-device generation is even more challenging. In this work, we propose a series of novel optimizations to significantly accelerate video generation and enable practical deployment on mobile platforms. First, we employ a highly compressed variational autoencoder (VAE) to reduce the dimensionality of the input data without sacrificing visual quality. Second, we introduce a KD-guided, sensitivity-aware tri-level pruning strategy to shrink the model size to suit mobile platforms while preserving critical performance characteristics. Third, we develop an adversarial step distillation technique tailored for DiT, which allows us to reduce the number of inference steps to four. Combined, these optimizations enable our model to achieve approximately 15 frames per second (FPS) generation speed on an iPhone 16 Pro Max, demonstrating the feasibility of efficient, high-quality video generation on mobile devices.