CLAIOct 24, 2020

Pre-trained Summarization Distillation

arXiv:2010.13002v2123 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the practical need for efficient summarization models by evaluating distillation techniques, though it is incremental as it compares existing methods.

The paper compared three distillation methods for pre-trained summarization models, finding that shrink-and-fine-tune outperformed knowledge distillation and pseudo-labeling on CNN/DailyMail but underperformed pseudo-labeling on XSUM.

Recent state-of-the-art approaches to summarization utilize large pre-trained Transformer models. Distilling these models to smaller student models has become critically important for practical use; however there are many different distillation methods proposed by the NLP literature. Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation. Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model. A third, simpler approach is to 'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning. We compare these three approaches for distillation of Pegasus and BART, the current and former state of the art, pre-trained summarization models, and find that SFT outperforms knowledge distillation and pseudo-labeling on the CNN/DailyMail dataset, but under-performs pseudo-labeling on the more abstractive XSUM dataset. PyTorch Code and checkpoints of different sizes are available through Hugging Face transformers here http://tiny.cc/4iy0tz.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes