CVNov 10, 2025

Distillation Dynamics: Towards Understanding Feature-Based Distillation in Vision Transformers

arXiv:2511.06848v21 citationsh-index: 4Has Code
Originality Highly original
AI Analysis

This addresses a critical problem for researchers and practitioners in model compression, providing theoretical insights to improve ViT distillation methods.

The paper investigates why feature-based knowledge distillation fails for Vision Transformers (ViTs) and identifies a U-shaped information processing pattern and representational mismatch as the root cause, showing that naive feature alignment harms student performance.

While feature-based knowledge distillation has proven highly effective for compressing CNNs, these techniques unexpectedly fail when applied to Vision Transformers (ViTs), often performing worse than simple logit-based distillation. We provide the first comprehensive analysis of this phenomenon through a novel analytical framework termed as "distillation dynamics", combining frequency spectrum analysis, information entropy metrics, and activation magnitude tracking. Our investigation reveals that ViTs exhibit a distinctive U-shaped information processing pattern: initial compression followed by expansion. We identify the root cause of negative transfer in feature distillation: a fundamental representational paradigm mismatch between teacher and student models. Through frequency-domain analysis, we show that teacher models employ distributed, high-dimensional encoding strategies in later layers that smaller student models cannot replicate due to limited channel capacity. This mismatch causes late-layer feature alignment to actively harm student performance. Our findings reveal that successful knowledge transfer in ViTs requires moving beyond naive feature mimicry to methods that respect these fundamental representational constraints, providing essential theoretical guidance for designing effective ViTs compression strategies. All source code and experimental logs are provided at https://github.com/thy960112/Distillation-Dynamics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes