CLLGNov 22, 2023

Efficient Transformer Knowledge Distillation: A Performance Review

arXiv:2311.13657v1131 citationsh-index: 2
Originality Synthesis-oriented
AI Analysis

This work addresses computational efficiency for NLP practitioners by providing a performance review of distillation on efficient transformers, though it is incremental as it evaluates existing methods rather than introducing new ones.

The paper tackles the problem of compressing efficient attention transformers via knowledge distillation, finding that distilled models preserve up to 98.8% of original performance on tasks like NER and QA while reducing inference times by up to 57.8%.

As pretrained transformer language models continue to achieve state-of-the-art performance, the Natural Language Processing community has pushed for advances in model compression and efficient attention mechanisms to address high computational requirements and limited input sequence length. Despite these separate efforts, no investigation has been done into the intersection of these two fields. In this work, we provide an evaluation of model compression via knowledge distillation on efficient attention transformers. We provide cost-performance trade-offs for the compression of state-of-the-art efficient attention architectures and the gains made in performance in comparison to their full attention counterparts. Furthermore, we introduce a new long-context Named Entity Recognition dataset, GONERD, to train and test the performance of NER models on long sequences. We find that distilled efficient attention transformers can preserve a significant amount of original model performance, preserving up to 98.6% across short-context tasks (GLUE, SQUAD, CoNLL-2003), up to 94.6% across long-context Question-and-Answering tasks (HotpotQA, TriviaQA), and up to 98.8% on long-context Named Entity Recognition (GONERD), while decreasing inference times by up to 57.8%. We find that, for most models on most tasks, performing knowledge distillation is an effective method to yield high-performing efficient attention models with low costs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes