CV LGOct 9, 2025

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang

Tsinghua

arXiv:2510.08431v131.458 citationsh-index: 13

Originality Highly original

AI Analysis

This enables practical acceleration of large-scale diffusion models for image and video generation, though it builds incrementally on existing consistency distillation methods.

This work scaled up continuous-time consistency distillation to large-scale text-to-image and video diffusion models by developing a FlashAttention-2 JVP kernel for training on models with over 10 billion parameters, and proposed a score-regularized consistency model (rCM) that improved visual quality while maintaining diversity. The distilled models generated high-fidelity samples in 1-4 steps, accelerating diffusion sampling by 15-50× and matching or surpassing state-of-the-art methods on quality metrics.

This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.

View on arXiv PDF

Similar