Distillation Scaling Laws
This work provides practical guidelines for compute-efficient distillation, benefiting researchers and practitioners in machine learning by mitigating risks in large-scale model training.
The paper tackles the problem of optimizing compute allocation between teacher and student models in knowledge distillation, finding that distillation outperforms supervised learning in scenarios with many students or an existing teacher, with performance scaling predictably based on student size.
We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform experimental design.