Generative Distribution Distillation
This work addresses the problem of efficient knowledge transfer in machine learning, particularly for scenarios with balanced, imbalanced, or unlabeled data, representing an incremental advancement with specific gains.
The paper tackled knowledge distillation by formulating it as a conditional generative problem, proposing the Generative Distribution Distillation (GenDD) framework, which achieved competitive unsupervised performance with a 16.29% improvement over a KL baseline on ImageNet and set a new state-of-the-art with 82.28% top-1 accuracy in supervised training.
In this paper, we formulate the knowledge distillation (KD) as a conditional generative problem and propose the \textit{Generative Distribution Distillation (GenDD)} framework. A naive \textit{GenDD} baseline encounters two major challenges: the curse of high-dimensional optimization and the lack of semantic supervision from labels. To address these issues, we introduce a \textit{Split Tokenization} strategy, achieving stable and effective unsupervised KD. Additionally, we develop the \textit{Distribution Contraction} technique to integrate label supervision into the reconstruction objective. Our theoretical proof demonstrates that \textit{GenDD} with \textit{Distribution Contraction} serves as a gradient-level surrogate for multi-task learning, realizing efficient supervised training without explicit classification loss on multi-step sampling image representations. To evaluate the effectiveness of our method, we conduct experiments on balanced, imbalanced, and unlabeled data. Experimental results show that \textit{GenDD} performs competitively in the unsupervised setting, significantly surpassing KL baseline by \textbf{16.29\%} on ImageNet validation set. With label supervision, our ResNet-50 achieves \textbf{82.28\%} top-1 accuracy on ImageNet in 600 epochs training, establishing a new state-of-the-art.