CLOct 16, 2021

A Short Study on Compressing Decoder-Based Language Models

arXiv:2110.08460v130 citations
Originality Incremental advance
AI Analysis

This work addresses the need for efficient NLP models on resource-limited devices, but it is incremental as it builds on existing compression techniques.

The paper tackles the problem of compressing decoder-based language models like GPT-2 for edge devices, achieving better performance with reduced training time compared to DistilGPT-2 through methods like knowledge distillation and layer truncation.

Pre-trained Language Models (PLMs) have been successful for a wide range of natural language processing (NLP) tasks. The state-of-the-art of PLMs, however, are extremely large to be used on edge devices. As a result, the topic of model compression has attracted increasing attention in the NLP community. Most of the existing works focus on compressing encoder-based models (tiny-BERT, distilBERT, distilRoBERTa, etc), however, to the best of our knowledge, the compression of decoder-based models (such as GPT-2) has not been investigated much. Our paper aims to fill this gap. Specifically, we explore two directions: 1) we employ current state-of-the-art knowledge distillation techniques to improve fine-tuning of DistilGPT-2. 2) we pre-train a compressed GPT-2 model using layer truncation and compare it against the distillation-based method (DistilGPT2). The training time of our compressed model is significantly less than DistilGPT-2, but it can achieve better performance when fine-tuned on downstream tasks. We also demonstrate the impact of data cleaning on model performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes