CVSep 26, 2024

Enhancing Logits Distillation with Plug\&Play Kendall's $τ$ Ranking Loss

arXiv:2409.17823v21 citationsh-index: 6Has Code
AI Analysis

This work addresses a bottleneck in knowledge distillation for machine learning practitioners, offering an incremental improvement to enhance model compression and transfer learning.

The paper tackles the problem of sub-optimal knowledge distillation due to gradient imbalances from KL divergence, proposing a plug-and-play Kendall's τ ranking loss that consistently boosts performance across multiple datasets and architectures, with improvements such as up to 2.1% accuracy gain on CIFAR-100.

Knowledge distillation typically minimizes the Kullback-Leibler (KL) divergence between teacher and student logits. However, optimizing the KL divergence can be challenging for the student and often leads to sub-optimal solutions. We further show that gradients induced by KL divergence scale with the magnitude of the teacher logits, thereby diminishing updates on low-probability channels. This imbalance weakens the transfer of inter-class information and in turn limits the performance improvements achievable by the student. To mitigate this issue, we propose a plug-and-play auxiliary ranking loss based on Kendall's $τ$ coefficient that can be seamlessly integrated into any logit-based distillation framework. It supplies inter-class relational information while rebalancing gradients toward low-probability channels. We demonstrate that the proposed ranking loss is largely invariant to channel scaling and optimizes an objective aligned with that of KL divergence, making it a natural complement rather than a replacement. Extensive experiments on CIFAR-100, ImageNet, and COCO datasets, as well as various CNN and ViT teacher-student architecture combinations, demonstrate that our plug-and-play ranking loss consistently boosts the performance of multiple distillation baselines. Code is available at https://github.com/OvernighTea/RankingLoss-KD

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes