LGAIMay 29

Rethinking the Role of Temperature in Large Language Model Distillation

arXiv:2606.0030627.6
AI Analysis

For practitioners of LLM distillation, this work corrects a widespread misconception about the relative merits of FKL and RKL by highlighting the critical role of temperature, enabling better distillation performance with simple KL-based methods.

The paper shows that temperature scaling in LLM distillation reverses the performance ranking of forward KL (FKL) and reverse KL (RKL) divergences: while RKL outperforms FKL at temperature 1, FKL surpasses RKL at higher temperatures across instruction-following benchmarks, and temperature improves a broader family of distillation objectives.

Reverse Kullback-Leibler (RKL) divergence is widely favored over forward KL (FKL) in large language models (LLM) distillation, yet this preference is largely based on comparisons that omit the temperature $τ$, overlooking its central role in softening teacher distributions and improving knowledge transfer. In this work, we revisit temperature in LLM distillation and show that it fundamentally changes the comparison between FKL and RKL. Our analysis reveals an asymmetric effect: temperature substantially enriches FKL with non-dominant token signals, whereas it mainly rescales RKL gradients, causing FKL to benefit much more from $τ$ scaling than RKL. This asymmetry overturns the standard empirical conclusion: although RKL outperforms FKL at $τ=1$, FKL consistently surpasses RKL at higher temperatures across instruction-following benchmarks. Moreover, the impact of temperature is not limited to FKL; it improves a broader family of distillation objectives, enabling simple KL-based methods to achieve competitive performance against recent state-of-the-art LLM distillation approaches.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes