AIDec 24, 2024

SlimGPT: Layer-wise Structured Pruning for Large Language Models

arXiv:2412.18110v144 citationsh-index: 3NIPS
Originality Incremental advance
AI Analysis

This work addresses the problem of deploying large language models efficiently for practitioners, though it is incremental as it builds on existing structured pruning techniques.

The authors tackled the challenge of efficiently pruning large language models (LLMs) to balance performance and deployment constraints, resulting in SlimGPT, a method that achieves state-of-the-art results on the LLaMA benchmark with pruning completed within one hour.

Large language models (LLMs) have garnered significant attention for their remarkable capabilities across various domains, whose vast parameter scales present challenges for practical deployment. Structured pruning is an effective method to balance model performance with efficiency, but performance restoration under computational resource constraints is a principal challenge in pruning LLMs. Therefore, we present a low-cost and fast structured pruning method for LLMs named SlimGPT based on the Optimal Brain Surgeon framework. We propose Batched Greedy Pruning for rapid and near-optimal pruning, which enhances the accuracy of head-wise pruning error estimation through grouped Cholesky decomposition and improves the pruning efficiency of FFN via Dynamic Group Size, thereby achieving approximate local optimal pruning results within one hour. Besides, we explore the limitations of layer-wise pruning from the perspective of error accumulation and propose Incremental Pruning Ratio, a non-uniform pruning strategy to reduce performance degradation. Experimental results on the LLaMA benchmark show that SlimGPT outperforms other methods and achieves state-of-the-art results.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes