Attention Temperature Matters in Abstractive Summarization Distillation
This work addresses the need for efficient summarization models in NLP applications, but it is incremental as it builds on existing pseudo-labeling techniques.
The paper tackles the problem of distilling large Transformer models for abstractive summarization into smaller ones to reduce computational cost, finding that manipulating attention temperatures in pseudo-labeling improves student model performance with consistent gains over baseline methods on three datasets.
Recent progress of abstractive text summarization largely relies on large pre-trained sequence-to-sequence Transformer models, which are computationally expensive. This paper aims to distill these large models into smaller ones for faster inference and minimal performance loss. Pseudo-labeling based methods are popular in sequence-to-sequence model distillation. In this paper, we find simply manipulating attention temperatures in Transformers can make pseudo labels easier to learn for student models. Our experiments on three summarization datasets show our proposed method consistently improves over vanilla pseudo-labeling based methods. We also find that both the pseudo labels and summaries produced by our students are shorter and more abstractive. Our code is available at \url{https://github.com/Shengqiang-Zhang/plate}.