CLAug 20, 2022

Combining Compressions for Multiplicative Size Scaling on Natural Language Tasks

arXiv:2208.09684v1584 citationsh-index: 21
Originality Incremental advance
AI Analysis

This work provides practical guidance for NLP practitioners on optimizing model compression to maximize accuracy vs. size tradeoffs, though it is incremental as it builds on existing compression techniques.

The study systematically compared combinations of quantization, knowledge distillation, and pruning across BERT architectures and GLUE tasks, finding that quantization and distillation offer greater benefits than pruning, with multiple methods often providing complementary or super-multiplicative size reductions without diminishing returns.

Quantization, knowledge distillation, and magnitude pruning are among the most popular methods for neural network compression in NLP. Independently, these methods reduce model size and can accelerate inference, but their relative benefit and combinatorial interactions have not been rigorously studied. For each of the eight possible subsets of these techniques, we compare accuracy vs. model size tradeoffs across six BERT architecture sizes and eight GLUE tasks. We find that quantization and distillation consistently provide greater benefit than pruning. Surprisingly, except for the pair of pruning and quantization, using multiple methods together rarely yields diminishing returns. Instead, we observe complementary and super-multiplicative reductions to model size. Our work quantitatively demonstrates that combining compression methods can synergistically reduce model size, and that practitioners should prioritize (1) quantization, (2) knowledge distillation, and (3) pruning to maximize accuracy vs. model size tradeoffs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes