CVMar 3, 2024

SCott: Accelerating Diffusion Models with Stochastic Consistency Distillation

Hongjian Liu, Qingsong Xie, TianXiang Ye, Zhijie Deng, Chen Chen, Shixiang Tang, Xueyang Fu, Haonan Lu, Zheng-jun Zha

arXiv:2403.01505v417.317 citationsh-index: 74AAAI

Originality Incremental advance

AI Analysis

This addresses the slow generation speed in diffusion models for AI image synthesis, offering a significant speed-up while maintaining quality, though it is an incremental improvement over consistency distillation methods.

The paper tackles the high inference latency of diffusion models by proposing Stochastic Consistency Distillation (SCott), which accelerates text-to-image generation to 2-4 sampling steps, achieving an FID of 21.9 with 2 steps on MSCOCO-2017, outperforming existing methods like InstaFlow and UFOGen.

The iterative sampling procedure employed by diffusion models (DMs) often leads to significant inference latency. To address this, we propose Stochastic Consistency Distillation (SCott) to enable accelerated text-to-image generation, where high-quality and diverse generations can be achieved within just 2-4 sampling steps. In contrast to vanilla consistency distillation (CD) which distills the ordinary differential equation solvers-based sampling process of a pre-trained teacher model into a student, SCott explores the possibility and validates the efficacy of integrating stochastic differential equation (SDE) solvers into CD to fully unleash the potential of the teacher. SCott is augmented with elaborate strategies to control the noise strength and sampling process of the SDE solver. An adversarial loss is further incorporated to strengthen the consistency constraints in rare sampling steps. Empirically, on the MSCOCO-2017 5K dataset with a Stable Diffusion-V1.5 teacher, SCott achieves an FID of 21.9 with 2 sampling steps, surpassing that of the 1-step InstaFlow (23.4) and the 4-step UFOGen (22.1). Moreover, SCott can yield more diverse samples than other consistency models for high-resolution image generation, with up to 16% improvement in a qualified metric.

View on arXiv PDF

Similar