SwiftVideo: A Unified Framework for Few-Step Video Generation through Trajectory-Distribution Alignment
This work addresses the problem of slow video synthesis for AI and multimedia applications, offering an incremental improvement by enhancing distillation techniques for faster generation.
The paper tackles the computational overhead of multi-step video generation models by proposing SwiftVideo, a unified distillation framework that combines trajectory-preserving and distribution-matching strategies, achieving high-quality video generation with significantly reduced inference steps and outperforming existing methods on the OpenVid-1M benchmark.
Diffusion-based or flow-based models have achieved significant progress in video synthesis but require multiple iterative sampling steps, which incurs substantial computational overhead. While many distillation methods that are solely based on trajectory-preserving or distribution-matching have been developed to accelerate video generation models, these approaches often suffer from performance breakdown or increased artifacts under few-step settings. To address these limitations, we propose \textbf{\emph{SwiftVideo}}, a unified and stable distillation framework that combines the advantages of trajectory-preserving and distribution-matching strategies. Our approach introduces continuous-time consistency distillation to ensure precise preservation of ODE trajectories. Subsequently, we propose a dual-perspective alignment that includes distribution alignment between synthetic and real data along with trajectory alignment across different inference steps. Our method maintains high-quality video generation while substantially reducing the number of inference steps. Quantitative evaluations on the OpenVid-1M benchmark demonstrate that our method significantly outperforms existing approaches in few-step video generation.