CLLGJul 3, 2024

MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language Models

arXiv:2407.02775v12 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses model compression for pre-trained language models, offering incremental improvements in efficiency and performance for NLP applications.

The paper tackles improving knowledge distillation for BERT by exploring relation-level knowledge and flexible attention head settings, resulting in outperforming state-of-the-art methods on GLUE and QA tasks with substantial inference time decrease and little performance drop.

Knowledge distillation is an effective technique for pre-trained language model compression. Although existing knowledge distillation methods perform well for the most typical model BERT, they could be further improved in two aspects: the relation-level knowledge could be further explored to improve model performance; and the setting of student attention head number could be more flexible to decrease inference time. Therefore, we are motivated to propose a novel knowledge distillation method MLKD-BERT to distill multi-level knowledge in teacher-student framework. Extensive experiments on GLUE benchmark and extractive question answering tasks demonstrate that our method outperforms state-of-the-art knowledge distillation methods on BERT. In addition, MLKD-BERT can flexibly set student attention head number, allowing for substantial inference time decrease with little performance drop.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes