LGJun 29, 2022

Revisiting Label Smoothing and Knowledge Distillation Compatibility: What was Missing?

arXiv:2206.14532v154 citationsh-index: 15
Originality Incremental advance
AI Analysis

It resolves a practical dilemma for machine learning practitioners in model compression and training efficiency by clarifying when to use label smoothing with knowledge distillation.

This paper tackles the contradictory findings on whether label smoothing (LS) is compatible with knowledge distillation (KD) by identifying systematic diffusion as the missing concept that limits the benefits of distilling from an LS-trained teacher, especially at high temperatures, and recommends using LS-trained teachers with low-temperature transfer for better student performance.

This work investigates the compatibility between label smoothing (LS) and knowledge distillation (KD). Contemporary findings addressing this thesis statement take dichotomous standpoints: Muller et al. (2019) and Shen et al. (2021b). Critically, there is no effort to understand and resolve these contradictory findings, leaving the primal question -- to smooth or not to smooth a teacher network? -- unanswered. The main contributions of our work are the discovery, analysis and validation of systematic diffusion as the missing concept which is instrumental in understanding and resolving these contradictory findings. This systematic diffusion essentially curtails the benefits of distilling from an LS-trained teacher, thereby rendering KD at increased temperatures ineffective. Our discovery is comprehensively supported by large-scale experiments, analyses and case studies including image classification, neural machine translation and compact student distillation tasks spanning across multiple datasets and teacher-student architectures. Based on our analysis, we suggest practitioners to use an LS-trained teacher with a low-temperature transfer to achieve high performance students. Code and models are available at https://keshik6.github.io/revisiting-ls-kd-compatibility/

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes