LGAICLFeb 6, 2025

Revisiting Intermediate-Layer Matching in Knowledge Distillation: Layer-Selection Strategy Doesn't Matter (Much)

arXiv:2502.04499v17 citationsh-index: 4IJCNLP-AACL
AI Analysis

This work addresses a practical problem for researchers and practitioners in model compression by suggesting that layer selection may be less critical than previously thought, though it is incremental in refining distillation techniques.

The paper investigates the impact of layer-selection strategies in intermediate-layer knowledge distillation, finding that even nonsensical strategies like reverse matching yield surprisingly good student performance, with results showing minimal differences across methods.

Knowledge distillation (KD) is a popular method of transferring knowledge from a large "teacher" model to a small "student" model. KD can be divided into two categories: prediction matching and intermediate-layer matching. We explore an intriguing phenomenon: layer-selection strategy does not matter (much) in intermediate-layer matching. In this paper, we show that seemingly nonsensical matching strategies such as matching the teacher's layers in reverse still result in surprisingly good student performance. We provide an interpretation for this phenomenon by examining the angles between teacher layers viewed from the student's perspective.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes