CLFeb 1, 2023

Improved Knowledge Distillation for Pre-trained Language Models via Knowledge Selection

arXiv:2302.00444v1293 citationsh-index: 32
Originality Incremental advance
AI Analysis

This work addresses the challenge of optimizing knowledge distillation for NLP practitioners, but it is incremental as it builds on existing distillation methods.

The paper tackles the problem of inefficient knowledge transfer in distillation for pre-trained language models by proposing an actor-critic approach to select appropriate knowledge at different training steps, resulting in significant performance improvements on GLUE datasets.

Knowledge distillation addresses the problem of transferring knowledge from a teacher model to a student model. In this process, we typically have multiple types of knowledge extracted from the teacher model. The problem is to make full use of them to train the student model. Our preliminary study shows that: (1) not all of the knowledge is necessary for learning a good student model, and (2) knowledge distillation can benefit from certain knowledge at different training steps. In response to these, we propose an actor-critic approach to selecting appropriate knowledge to transfer during the process of knowledge distillation. In addition, we offer a refinement of the training algorithm to ease the computational burden. Experimental results on the GLUE datasets show that our method outperforms several strong knowledge distillation baselines significantly.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes