CLJul 24, 2025

Prune&Comp: Free Lunch for Layer-Pruned LLMs via Iterative Pruning with Magnitude Compensation

arXiv:2507.18212v15 citationsh-index: 7
Originality Incremental advance
AI Analysis

This addresses efficiency issues in deploying compressed LLMs, offering a plug-and-play solution for improved performance, though it is incremental as it builds on existing layer pruning techniques.

The paper tackles performance degradation in layer-pruned large language models by proposing Prune&Comp, a training-free method that compensates for magnitude gaps in hidden states, resulting in nearly halved perplexity and retaining 93.19% of original question-answering performance when pruning 5 layers of LLaMA-3-8B.

Layer pruning has emerged as a promising technique for compressing large language models (LLMs) while achieving acceleration proportional to the pruning ratio. In this work, we identify that removing any layer induces a significant magnitude gap in hidden states, resulting in substantial performance degradation. To address this issue, we propose Prune&Comp, a novel plug-and-play layer pruning scheme that leverages magnitude compensation to mitigate such gaps in a training-free manner. Specifically, we first estimate the magnitude gap caused by layer removal and then eliminate this gap by rescaling the remaining weights offline, with zero runtime overhead incurred. We further demonstrate the advantages of Prune&Comp through an iterative pruning strategy. When integrated with an iterative prune-and-compensate loop, Prune&Comp consistently enhances existing layer pruning metrics. For instance, when 5 layers of LLaMA-3-8B are pruned using the prevalent block influence metric, Prune&Comp nearly halves the perplexity and retains 93.19\% of the original model's question-answering performance, outperforming the baseline by 4.01%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes