LGAICLMLFeb 12, 2025

Distillation Scaling Laws

AppleBerkeley
arXiv:2502.08606v245 citationsh-index: 14ICML
Originality Incremental advance
AI Analysis

This work provides practical guidelines for compute-efficient distillation, benefiting researchers and practitioners in machine learning by mitigating risks in large-scale model training.

The paper tackles the problem of optimizing compute allocation between teacher and student models in knowledge distillation, finding that distillation outperforms supervised learning in scenarios with many students or an existing teacher, with performance scaling predictably based on student size.

We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform experimental design.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes