On Equivariance and Fast Sampling in Video Diffusion Models Trained with Warped Noise
This addresses the need for coherent and efficient video generation for applications like style transfer and upsampling, representing a strong specific gain rather than a foundational breakthrough.
The paper tackled the problem of temporally consistent video generation by analyzing warped noise training, showing it induces equivariance to spatial transformations, which improves motion alignment and sampling efficiency. The resulting EquiVDM model outperforms prior methods in benchmarks for motion alignment, temporal consistency, and perceptual quality while reducing sampling steps.
Temporally consistent video-to-video generation is critical for applications such as style transfer and upsampling. In this paper, we provide a theoretical analysis of warped noise - a recently proposed technique for training video diffusion models - and show that pairing it with the standard denoising objective implicitly trains models to be equivariant to spatial transformations of the input noise, which we term EquiVDM. This equivariance enables motion in the input noise to align naturally with motion in the generated video, yielding coherent, high-fidelity outputs without the need for specialized modules or auxiliary losses. A further advantage is sampling efficiency: EquiVDM achieves comparable or superior quality in far fewer sampling steps. When distilled into one-step student models, EquiVDM preserves equivariance and delivers stronger motion controllability and fidelity than distilled nonequivariant baselines. Across benchmarks, EquiVDM consistently outperforms prior methods in motion alignment, temporal consistency, and perceptual quality, while substantially lowering sampling cost.