LGJan 12, 2024

An Empirical Investigation into the Effect of Parameter Choices in Knowledge Distillation

IBM
arXiv:2401.06356v2h-index: 27
Originality Synthesis-oriented
AI Analysis

This work addresses the lack of systematic understanding of parameter effects in knowledge distillation for NLP practitioners, though it is incremental as it builds on existing empirical efforts.

The paper conducted a large-scale empirical study to investigate how configuration parameter choices affect performance in knowledge distillation, finding that sub-optimal choices can significantly impact student performance across 13 datasets from 4 NLP tasks and 3 student sizes, and identified a single configuration that performs well overall.

We present a large-scale empirical study of how choices of configuration parameters affect performance in knowledge distillation (KD). An example of such a KD parameter is the measure of distance between the predictions of the teacher and the student, common choices for which include the mean squared error (MSE) and the KL-divergence. Although scattered efforts have been made to understand the differences between such options, the KD literature still lacks a systematic study on their general effect on student performance. We take an empirical approach to this question in this paper, seeking to find out the extent to which such choices influence student performance across 13 datasets from 4 NLP tasks and 3 student sizes. We quantify the cost of making sub-optimal choices and identify a single configuration that performs well across the board.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes