CLLGMLOct 10, 2019

Structured Pruning of Large Language Models

arXiv:1910.04732v21076 citations
AI Analysis

This work addresses efficiency issues for users of large language models, but it is incremental as it builds on existing pruning techniques.

The paper tackles the problem of high computational cost and latency in large language models by proposing a structured pruning method that removes low-rank components during training, achieving better performance than other pruning baselines and significant speedups in training and inference.

Large language models have recently achieved state of the art performance across a wide variety of natural language tasks. Meanwhile, the size of these models and their latency have significantly increased, which makes their usage costly, and raises an interesting question: do language models need to be large? We study this question through the lens of model compression. We present a generic, structured pruning approach by parameterizing each weight matrix using its low-rank factorization, and adaptively removing rank-1 components during training. On language modeling tasks, our structured approach outperforms other unstructured and block-structured pruning baselines at various compression levels, while achieving significant speedups during both training and inference. We also demonstrate that our method can be applied to pruning adaptive word embeddings in large language models, and to pruning the BERT model on several downstream fine-tuning classification benchmarks.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes