LoCa: Logit Calibration for Knowledge Distillation
This work addresses a specific issue in model compression for practitioners, but it is incremental as it builds on existing knowledge distillation techniques.
The paper tackles the problem of mis-instruction in knowledge distillation, where student models are misled by teacher logits that conflict with ground-truth labels, and proposes LoCa, a logit calibration method that corrects predictions while preserving useful dark knowledge, improving baseline performance on image classification and text generation tasks.
Knowledge Distillation (KD), aiming to train a better student model by mimicking the teacher model, plays an important role in model compression. One typical way is to align the output logits. However, we find a common issue named mis-instruction, that the student would be misled when the predictions based on teacher logits do not follow the labels. Meanwhile, there is other useful dark knowledge in the logits such as the class discriminability, which is vital for distillation. In this paper, we propose a simple yet effective Logit Calibration (LoCa) method, which calibrates the logits from the teacher model based on the ground-truth labels. The key insight is to correct the prediction (to address the mis-instruction issue) and maintain useful dark knowledge simultaneously. Our proposed LoCa does not require any additional parameters. Empirical results on image classification and text generation tasks demonstrate that LoCa can effectively improve the performance of baselines.