CLAILGFeb 15, 2022

A Survey on Model Compression and Acceleration for Pretrained Language Models

arXiv:2202.07105v299 citations
AI Analysis

It addresses efficiency challenges for deploying large language models in edge and mobile computing, but is incremental as a survey paper.

This survey reviews model compression and acceleration techniques for pretrained language models to address high energy costs and inference delays, focusing on benchmarks, metrics, and methodologies for the inference stage.

Despite achieving state-of-the-art performance on many NLP tasks, the high energy cost and long inference delay prevent Transformer-based pretrained language models (PLMs) from seeing broader adoption including for edge and mobile computing. Efficient NLP research aims to comprehensively consider computation, time and carbon emission for the entire life-cycle of NLP, including data preparation, model training and inference. In this survey, we focus on the inference stage and review the current state of model compression and acceleration for pretrained language models, including benchmarks, metrics and methodology.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes