CLLGMay 16, 2023

Weight-Inherited Distillation for Task-Agnostic BERT Compression

arXiv:2305.09098v234 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the problem of efficient model deployment for NLP practitioners by offering a more direct compression method, though it is incremental as it builds on existing knowledge distillation techniques.

The paper tackles BERT compression by proposing Weight-Inherited Distillation (WID), a method that directly transfers knowledge from a teacher model to a student by inheriting weights without extra alignment losses, achieving state-of-the-art results on GLUE and SQuAD benchmarks.

Knowledge Distillation (KD) is a predominant approach for BERT compression. Previous KD-based methods focus on designing extra alignment losses for the student model to mimic the behavior of the teacher model. These methods transfer the knowledge in an indirect way. In this paper, we propose a novel Weight-Inherited Distillation (WID), which directly transfers knowledge from the teacher. WID does not require any additional alignment loss and trains a compact student by inheriting the weights, showing a new perspective of knowledge distillation. Specifically, we design the row compactors and column compactors as mappings and then compress the weights via structural re-parameterization. Experimental results on the GLUE and SQuAD benchmarks show that WID outperforms previous state-of-the-art KD-based baselines. Further analysis indicates that WID can also learn the attention patterns from the teacher model without any alignment loss on attention distributions. The code is available at https://github.com/wutaiqiang/WID-NAACL2024.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes