LGIVSPApr 27, 2025

Swapped Logit Distillation via Bi-level Teacher Alignment

arXiv:2504.20108v11 citationsh-index: 28Multimedia Systems
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in knowledge distillation for model compression, offering an incremental improvement over existing methods.

The paper tackled the problem of incorrect predictions in knowledge distillation by proposing Swapped Logit Distillation, which uses swapped logit processing and loss scheduling to align teacher and student outputs, resulting in state-of-the-art performance on image classification tasks.

Knowledge distillation (KD) compresses the network capacity by transferring knowledge from a large (teacher) network to a smaller one (student). It has been mainstream that the teacher directly transfers knowledge to the student with its original distribution, which can possibly lead to incorrect predictions. In this article, we propose a logit-based distillation via swapped logit processing, namely Swapped Logit Distillation (SLD). SLD is proposed under two assumptions: (1) the wrong prediction occurs when the prediction label confidence is not the maximum; (2) the "natural" limit of probability remains uncertain as the best value addition to the target cannot be determined. To address these issues, we propose a swapped logit processing scheme. Through this approach, we find that the swap method can be effectively extended to teacher and student outputs, transforming into two teachers. We further introduce loss scheduling to boost the performance of two teachers' alignment. Extensive experiments on image classification tasks demonstrate that SLD consistently performs best among previous state-of-the-art methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes