CVNov 12, 2025

Enriching Knowledge Distillation with Cross-Modal Teacher Fusion

arXiv:2511.09286v11 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses a bottleneck in knowledge distillation for machine learning practitioners by enhancing model performance and reliability, though it is incremental as it builds on existing multi-teacher methods.

The paper tackled the problem of limited knowledge diversity in multi-teacher knowledge distillation by incorporating CLIP's vision-language knowledge as complementary supervision, resulting in improved accuracy, robustness, and more confident predictions across multiple benchmarks.

Multi-teacher knowledge distillation (KD), a more effective technique than traditional single-teacher methods, transfers knowledge from expert teachers to a compact student model using logit or feature matching. However, most existing approaches lack knowledge diversity, as they rely solely on unimodal visual information, overlooking the potential of cross-modal representations. In this work, we explore the use of CLIP's vision-language knowledge as a complementary source of supervision for KD, an area that remains largely underexplored. We propose a simple yet effective framework that fuses the logits and features of a conventional teacher with those from CLIP. By incorporating CLIP's multi-prompt textual guidance, the fused supervision captures both dataset-specific and semantically enriched visual cues. Beyond accuracy, analysis shows that the fused teacher yields more confident and reliable predictions, significantly increasing confident-correct cases while reducing confidently wrong ones. Moreover, fusion with CLIP refines the entire logit distribution, producing semantically meaningful probabilities for non-target classes, thereby improving inter-class consistency and distillation quality. Despite its simplicity, the proposed method, Enriching Knowledge Distillation (RichKD), consistently outperforms most existing baselines across multiple benchmarks and exhibits stronger robustness under distribution shifts and input corruptions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes