LG CVDec 4, 2025

Rethinking Decoupled Knowledge Distillation: A Predictive Distribution Perspective

arXiv:2512.04625v17.11 citationsh-index: 1Has CodeIEEE Trans Neural Netw Learn Syst

Originality Incremental advance

AI Analysis

This work addresses knowledge distillation for model compression, presenting an incremental improvement over existing methods.

The paper tackles the problem of knowledge distillation by rethinking Decoupled Knowledge Distillation (DKD) from a predictive distribution perspective, proposing Generalized Decoupled Knowledge Distillation (GDKD) with a streamlined algorithm, and demonstrates superior performance over DKD and other methods on benchmarks like CIFAR-100 and ImageNet.

In the history of knowledge distillation, the focus has once shifted over time from logit-based to feature-based approaches. However, this transition has been revisited with the advent of Decoupled Knowledge Distillation (DKD), which re-emphasizes the importance of logit knowledge through advanced decoupling and weighting strategies. While DKD marks a significant advancement, its underlying mechanisms merit deeper exploration. As a response, we rethink DKD from a predictive distribution perspective. First, we introduce an enhanced version, the Generalized Decoupled Knowledge Distillation (GDKD) loss, which offers a more versatile method for decoupling logits. Then we pay particular attention to the teacher model's predictive distribution and its impact on the gradients of GDKD loss, uncovering two critical insights often overlooked: (1) the partitioning by the top logit considerably improves the interrelationship of non-top logits, and (2) amplifying the focus on the distillation loss of non-top logits enhances the knowledge extraction among them. Utilizing these insights, we further propose a streamlined GDKD algorithm with an efficient partition strategy to handle the multimodality of teacher models' predictive distribution. Our comprehensive experiments conducted on a variety of benchmarks, including CIFAR-100, ImageNet, Tiny-ImageNet, CUB-200-2011, and Cityscapes, demonstrate GDKD's superior performance over both the original DKD and other leading knowledge distillation methods. The code is available at https://github.com/ZaberKo/GDKD.

View on arXiv PDF Code

Similar