CVAILGNov 18, 2025

Logit-Based Losses Limit the Effectiveness of Feature Knowledge Distillation

arXiv:2511.14981v1Has Code
Originality Highly original
AI Analysis

This work addresses the inefficiency of logit-based losses in knowledge distillation for machine learning practitioners, offering a novel method that is incremental but shows strong gains.

The paper tackles the problem of feature knowledge distillation by proposing a framework that uses only feature-based losses, excluding logit-based losses, and introduces a knowledge quality metric to select effective teacher layers. The result is state-of-the-art performance with top-1 accuracy boosts of up to 15% on image classification datasets.

Knowledge distillation (KD) methods can transfer knowledge of a parameter-heavy teacher model to a light-weight student model. The status quo for feature KD methods is to utilize loss functions based on logits (i.e., pre-softmax class scores) and intermediate layer features (i.e., latent representations). Unlike previous approaches, we propose a feature KD framework for training the student's backbone using feature-based losses exclusively (i.e., without logit-based losses such as cross entropy). Leveraging recent discoveries about the geometry of latent representations, we introduce a knowledge quality metric for identifying which teacher layers provide the most effective knowledge for distillation. Experiments on three image classification datasets with four diverse student-teacher pairs, spanning convolutional neural networks and vision transformers, demonstrate our KD method achieves state-of-the-art performance, delivering top-1 accuracy boosts of up to 15% over standard approaches. We publically share our code to facilitate future work at https://github.com/Thegolfingocto/KD_wo_CE.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes