Towards Training One-Step Diffusion Models Without Distillation
This work addresses the inefficiency of two-stage training pipelines in diffusion models for machine learning practitioners, though it is incremental as it still depends on teacher initialization.
The paper tackles the problem of training one-step diffusion models without relying on distillation from a teacher model, showing that new training methods can outperform teacher-guided approaches while still requiring teacher weight initialization for feature representation benefits.
Recent advances in training one-step diffusion models typically follow a two-stage pipeline: first training a teacher diffusion model and then distilling it into a one-step student model. This process often depends on both the teacher's score function for supervision and its weights for initializing the student model. In this paper, we explore whether one-step diffusion models can be trained directly without this distillation procedure. We introduce a family of new training methods that entirely forgo teacher score supervision, yet outperforms most teacher-guided distillation approaches. This suggests that score supervision is not essential for effective training of one-step diffusion models. However, we find that initializing the student model with the teacher's weights remains critical. Surprisingly, the key advantage of teacher initialization is not due to better latent-to-output mappings, but rather the rich set of feature representations across different noise levels that the teacher diffusion model provides. These insights take us one step closer towards training one-step diffusion models without distillation and provide a better understanding of the roles of teacher supervision and initialization in the distillation process.