CLLGDec 5, 2021

Causal Distillation for Language Models

arXiv:2112.02505v2639 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of creating more compact and efficient language models for NLP applications, representing an incremental improvement over standard distillation methods.

The paper tackles the problem of improving knowledge distillation for language models by introducing an additional objective that encourages the student model to imitate the causal computation process of the teacher model, resulting in lower perplexity on Wikipedia and marked improvements on GLUE, SQuAD, and CoNLL-2003 benchmarks.

Distillation efforts have led to language models that are more compact and efficient without serious drops in performance. The standard approach to distillation trains a student model against two objectives: a task-specific objective (e.g., language modeling) and an imitation objective that encourages the hidden states of the student model to be similar to those of the larger teacher model. In this paper, we show that it is beneficial to augment distillation with a third objective that encourages the student to imitate the causal computation process of the teacher through interchange intervention training(IIT). IIT pushes the student model to become a causal abstraction of the teacher model - a simpler model with the same causal structure. IIT is fully differentiable, easily implemented, and combines flexibly with other objectives. Compared with standard distillation of BERT, distillation via IIT results in lower perplexity on Wikipedia (masked language modeling) and marked improvements on the GLUE benchmark (natural language understanding), SQuAD (question answering), and CoNLL-2003 (named entity recognition).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes