SKDBERT: Compressing BERT via Stochastic Knowledge Distillation
This addresses the need for efficient, compact language models for NLP applications, representing an incremental improvement in knowledge distillation techniques.
The paper tackles the problem of compressing BERT models by proposing Stochastic Knowledge Distillation (SKD) to create SKDBERT, which reduces the size of BERT_BASE by 40% while retaining 99.5% performance on GLUE and achieving 100% faster inference.
In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style language model dubbed SKDBERT. In each iteration, SKD samples a teacher model from a pre-defined teacher ensemble, which consists of multiple teacher models with multi-level capacities, to transfer knowledge into student model in an one-to-one manner. Sampling distribution plays an important role in SKD. We heuristically present three types of sampling distributions to assign appropriate probabilities for multi-level teacher models. SKD has two advantages: 1) it can preserve the diversities of multi-level teacher models via stochastically sampling single teacher model in each iteration, and 2) it can also improve the efficacy of knowledge distillation via multi-level teacher models when large capacity gap exists between the teacher model and the student model. Experimental results on GLUE benchmark show that SKDBERT reduces the size of a BERT$_{\rm BASE}$ model by 40% while retaining 99.5% performances of language understanding and being 100% faster.