CLApr 7, 2020

Towards Non-task-specific Distillation of BERT via Sentence Representation Approximation

arXiv:2004.03097v1991 citations
AI Analysis

This addresses the deployment challenges of BERT for NLP practitioners by enabling more efficient models without losing general semantic knowledge, though it is incremental as it builds on existing distillation techniques.

The paper tackles the problem of BERT's large size and computational cost by proposing a non-task-specific distillation method that transfers BERT's knowledge into a smaller LSTM-based model, achieving improved efficiency and outperforming task-specific distillation methods and larger models like ELMO on GLUE benchmark tasks.

Recently, BERT has become an essential ingredient of various NLP deep models due to its effectiveness and universal-usability. However, the online deployment of BERT is often blocked by its large-scale parameters and high computational cost. There are plenty of studies showing that the knowledge distillation is efficient in transferring the knowledge from BERT into the model with a smaller size of parameters. Nevertheless, current BERT distillation approaches mainly focus on task-specified distillation, such methodologies lead to the loss of the general semantic knowledge of BERT for universal-usability. In this paper, we propose a sentence representation approximating oriented distillation framework that can distill the pre-trained BERT into a simple LSTM based model without specifying tasks. Consistent with BERT, our distilled model is able to perform transfer learning via fine-tuning to adapt to any sentence-level downstream task. Besides, our model can further cooperate with task-specific distillation procedures. The experimental results on multiple NLP tasks from the GLUE benchmark show that our approach outperforms other task-specific distillation methods or even much larger models, i.e., ELMO, with efficiency well-improved.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes