Why distillation helps: a statistical perspective
This work addresses a fundamental problem in machine learning for researchers and practitioners by offering theoretical insights into a widely used but poorly understood technique, though it is incremental as it builds on existing distillation methods.
The paper tackles the question of why knowledge distillation improves student model performance by providing a statistical perspective, revealing a bias-variance tradeoff in the student's objective that quantifies how approximate class-probability estimates from a teacher aid learning.
Knowledge distillation is a technique for improving the performance of a simple "student" model by replacing its one-hot training labels with a distribution over labels obtained from a complex "teacher" model. While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help? In this paper, we present a statistical perspective on distillation which addresses this question, and provides a novel connection to extreme multiclass retrieval techniques. Our core observation is that the teacher seeks to estimate the underlying (Bayes) class-probability function. Building on this, we establish a fundamental bias-variance tradeoff in the student's objective: this quantifies how approximate knowledge of these class-probabilities can significantly aid learning. Finally, we show how distillation complements existing negative mining techniques for extreme multiclass retrieval, and propose a unified objective which combines these ideas.