CLAILGMay 24, 2023

How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives

arXiv:2305.15032v1226 citations
Originality Incremental advance
AI Analysis

This work addresses the need for efficient model compression in NLP, but it is incremental as it builds on existing distillation methods with a comprehensive empirical study.

This paper tackled the problem of compressing BERT models via knowledge distillation by evaluating intermediate layer distillation objectives in task-specific and task-agnostic settings, finding that attention transfer performs best overall and that weight initialization significantly impacts performance, with improvements up to 17.8% in accuracy on QNLI.

Recently, various intermediate layer distillation (ILD) objectives have been shown to improve compression of BERT models via Knowledge Distillation (KD). However, a comprehensive evaluation of the objectives in both task-specific and task-agnostic settings is lacking. To the best of our knowledge, this is the first work comprehensively evaluating distillation objectives in both settings. We show that attention transfer gives the best performance overall. We also study the impact of layer choice when initializing the student from the teacher layers, finding a significant impact on the performance in task-specific distillation. For vanilla KD and hidden states transfer, initialisation with lower layers of the teacher gives a considerable improvement over higher layers, especially on the task of QNLI (up to an absolute percentage change of 17.8 in accuracy). Attention transfer behaves consistently under different initialisation settings. We release our code as an efficient transformer-based model distillation framework for further studies.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes