Diversity-Aware Reverse Kullback-Leibler Divergence for Large Language Model Distillation
This work addresses a structural limitation in distillation for large language models, which is important for efficient deployment but is incremental as it builds on existing RKL methods.
The paper tackled the problem of reverse Kullback-Leibler (RKL) divergence causing overconfident predictions and reduced diversity in large language model distillation, and proposed Diversity-aware RKL (DRKL) to address these issues, achieving better performance and a superior fidelity-diversity trade-off in experiments.
Reverse Kullback-Leibler (RKL) divergence has recently emerged as the preferred objective for large language model (LLM) distillation, consistently outperforming forward KL (FKL), particularly in regimes with large vocabularies and significant teacher-student capacity mismatch, where RKL focuses learning on dominant modes rather than enforcing dense alignment. However, RKL introduces a structural limitation that drives the student toward overconfident predictions. We first provide an analysis of RKL by decomposing its gradients into target and non-target components, and show that non-target gradients consistently push the target logit upward even when the student already matches the teacher, thereby reducing output diversity. In addition, RKL provides weak supervision over non-target classes, leading to poor tail alignment. To address these issues, we propose Diversity-aware RKL (DRKL), which removes this gradient effect and strengthens non-target supervision while preserving the optimization benefits of RKL. Extensive experiments across datasets and model families demonstrate that DRKL consistently outperforms FKL, RKL, and other state-of-the-art distillation objectives, achieving better performance and a superior fidelity-diversity trade-off.