LG CLMay 3, 2024

Structural Pruning of Pre-trained Language Models via Neural Architecture Search

Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau

arXiv:2405.02267v210.47 citationsh-index: 4Has CodeTrans. Mach. Learn. Res.

Originality Incremental advance

AI Analysis

This addresses efficiency challenges for deploying language models in real-world applications, representing an incremental improvement over traditional pruning methods.

The paper tackles the problem of deploying large pre-trained language models by using neural architecture search for structural pruning to find sub-networks that balance efficiency (e.g., model size or latency) and performance, proposing a multi-objective approach to identify Pareto optimal sets for automated compression.

Pre-trained language models (PLM), for example BERT or RoBERTa, mark the state-of-the-art for natural language understanding task when fine-tuned on labeled data. However, their large size poses challenges in deploying them for inference in real-world applications, due to significant GPU memory requirements and high inference latency. This paper explores neural architecture search (NAS) for structural pruning to find sub-parts of the fine-tuned network that optimally trade-off efficiency, for example in terms of model size or latency, and generalization performance. We also show how we can utilize more recently developed two-stage weight-sharing NAS approaches in this setting to accelerate the search process. Unlike traditional pruning methods with fixed thresholds, we propose to adopt a multi-objective approach that identifies the Pareto optimal set of sub-networks, allowing for a more flexible and automated compression process.

View on arXiv PDF Code

Similar