On the Generalization of Knowledge Distillation: An Information-Theoretic View

arXiv:2605.131439.2

Predicted impact top 85% in IT · last 90 daysOriginality Incremental advance

AI Analysis

For machine learning researchers, this work advances the theoretical understanding of why knowledge distillation improves generalization, though it remains largely theoretical with no empirical validation.

The paper provides a theoretical framework for knowledge distillation using information theory, deriving generalization bounds that show how teacher flatness and distillation divergence affect student performance. It offers practical design guidance through a decomposition of distillation costs.

Knowledge distillation is widely used to improve generalization in practice, yet its theoretical understanding remains elusive. In the standard distillation setting, a teacher model provides soft predictions to guide the training of a student model. We model teacher and student training as coupled stochastic processes and introduce a distillation divergence, defined as the Kullback-Leibler divergence between these two stochastic kernels. Within this framework, we derive two generalization bounds for the student model relative to the teacher's generalization gap: an upper bound under a sub-Gaussian assumption via algorithmic stability, and a lower bound under a central condition with sharper dependence on the distillation divergence. We further develop a loss-sharpness-aware bound with an explicit tightness regime, showing that the teacher's local flatness can strictly tighten the bound. Additionally, in a linear Gaussian case study, the distillation divergence admits an interpretable decomposition into bias, variance, and rank-bottleneck costs, yielding practical guidance for distillation design.

View on arXiv PDF

Similar