Supervision Complexity and its Role in Knowledge Distillation
This work provides a theoretical foundation for knowledge distillation, addressing a gap in understanding for researchers and practitioners in machine learning, though it is incremental in building on existing distillation methods.
The authors tackled the problem of understanding why knowledge distillation works by proposing a theoretical framework based on supervision complexity, which explains the utility of techniques like early stopping and temperature scaling and suggests online distillation, demonstrating its efficacy on image classification benchmarks.
Despite the popularity and efficacy of knowledge distillation, there is limited understanding of why it helps. In order to study the generalization behavior of a distilled student, we propose a new theoretical framework that leverages supervision complexity: a measure of alignment between teacher-provided supervision and the student's neural tangent kernel. The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions. Specifically, it provides a rigorous justification for the utility of various techniques that are prevalent in the context of distillation, such as early stopping and temperature scaling. Our analysis further suggests the use of online distillation, where a student receives increasingly more complex supervision from teachers in different stages of their training. We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.