CLFeb 1

Rethinking Selective Knowledge Distillation

arXiv:2602.01395v1
Originality Incremental advance
AI Analysis

This work addresses efficiency bottlenecks in knowledge distillation for LLMs, offering practical gains for model deployment, though it is incremental in refining existing selective distillation approaches.

The paper tackled the problem of improving knowledge distillation in large language models by analyzing selective distillation across position, class, and sample axes, and introduced student-entropy-guided position selection (SE-KD), which reduced wall time by 70%, peak memory by 18%, and storage usage by 80% without performance loss.

Growing efforts to improve knowledge distillation (KD) in large language models (LLMs) replace dense teacher supervision with selective distillation, which uses a subset of token positions, vocabulary classes, or training samples for supervision. However, it remains unclear which importance signals, selection policies, and their interplay are most effective. In this work, we revisit where and how to distill in autoregressive LLMs. We disentangle selective KD along the position, class, and sample axes and systematically compare importance signals and selection policies. Then, guided by this analysis, we identify underexplored opportunities and introduce student-entropy-guided position selection (SE-KD). Across a suite of benchmarks, SE-KD often improves accuracy, downstream task adherence, and memory efficiency over dense distillation. Extending this approach across the class and sample axes (SE-KD 3X) yields complementary efficiency gains that make offline teacher caching feasible. In practice, this reduces wall time by 70% and peak memory by 18%, while cutting storage usage by 80% over prior methods without sacrificing performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes