LGCVMar 2

A Unified Revisit of Temperature in Classification-Based Knowledge Distillation

arXiv:2603.02430v2h-index: 4
AI Analysis

This work addresses a practical issue for machine learning practitioners using knowledge distillation, offering incremental insights to improve efficiency and performance.

The paper tackles the problem of selecting the temperature parameter in knowledge distillation, which is often done via inefficient grid search, by systematically studying its interactions with training components like optimizers and teacher pretraining to provide practical guidance.

A central idea of knowledge distillation is to expose relational structure embedded in the teacher's weights for the student to learn, which is often facilitated using a temperature parameter. Despite its widespread use, there remains limited understanding on how to select an appropriate temperature value, or how this value depends on other training elements such as optimizer, teacher pretraining/finetuning, etc. In practice, temperature is commonly chosen via grid search or by adopting values from prior work, which can be time-consuming or may lead to suboptimal student performance when training setups differ. In this work, we posit that temperature is closely linked to these training components and present a unified study that systematically examines such interactions. From analyzing these cross-connections, we identify and present common situations that have a pronounced impact on temperature selection, providing valuable guidance for practitioners employing knowledge distillation in their work.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes