AIJul 26, 2024

Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining

arXiv:2407.19126v111 citationsh-index: 7
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficient model compression for LLMs, offering a practical solution for deployment in resource-constrained environments, though it is incremental as it builds on existing pruning techniques.

The paper tackled the problem of pruning large language models (LLMs) without retraining by proposing a single-shot method with a depth-2 structure and inference-aware criteria, achieving significant reductions in computational costs and hardware requirements while maintaining superior performance across datasets and models.

To remove redundant components of large language models (LLMs) without incurring significant computational costs, this work focuses on single-shot pruning without a retraining phase. We simplify the pruning process for Transformer-based LLMs by identifying a depth-2 pruning structure that functions independently. Additionally, we propose two inference-aware pruning criteria derived from the optimization perspective of output approximation, which outperforms traditional training-aware metrics such as gradient and Hessian. We also introduce a two-step reconstruction technique to mitigate pruning errors without model retraining. Experimental results demonstrate that our approach significantly reduces computational costs and hardware requirements while maintaining superior performance across various datasets and models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes