LGAIMar 24, 2025

Maximum Redundancy Pruning: A Principle-Driven Layerwise Sparsity Allocation for LLMs

arXiv:2503.18377v21 citationsh-index: 3MM
Originality Incremental advance
AI Analysis

This work addresses the challenge of deploying large language models in real-world applications by improving pruning efficiency, though it is incremental as it builds on existing sparsity allocation techniques.

The paper tackles the problem of allocating sparsity across layers in large language models (LLMs) for efficient deployment by proposing Maximum Redundancy Pruning (MRP), which iteratively prunes the most redundant layers based on principles of non-uniformity, metric dependency, and uniform redundancy, achieving superior performance over previous methods on benchmarks like LLaMA2 and OPT.

Large language models (LLMs) have demonstrated impressive capabilities, but their enormous size poses significant challenges for deployment in real-world applications. To address this issue, researchers have sought to apply network pruning techniques to LLMs. A critical challenge in pruning is allocation the sparsity for each layer. Recent sparsity allocation methods is often based on heuristics or search that can easily lead to suboptimal performance. In this paper, we conducted an extensive investigation into various LLMs and revealed three significant discoveries: (1) the layerwise pruning sensitivity (LPS) of LLMs is highly non-uniform, (2) the choice of pruning metric affects LPS, and (3) the performance of a sparse model is related to the uniformity of its layerwise redundancy level. Based on these observations, we propose that the layerwise sparsity of LLMs should adhere to three principles: \emph{non-uniformity}, \emph{pruning metric dependency}, and \emph{uniform layerwise redundancy level} in the pruned model. To this end, we proposed Maximum Redundancy Pruning (MRP), an iterative pruning algorithm that prunes in the most redundant layers (\emph{i.e.}, those with the highest non-outlier ratio) at each iteration. The achieved layerwise sparsity aligns with the outlined principles. We conducted extensive experiments on publicly available LLMs, including the LLaMA2 and OPT, across various benchmarks. Experimental results validate the effectiveness of MRP, demonstrating its superiority over previous methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes