CLAILGOct 20, 2025

From Local to Global: Revisiting Structured Pruning Paradigms for Large Language Models

arXiv:2510.18030v12 citationsh-index: 16
Originality Highly original
AI Analysis

This work addresses the need for efficient deployment of large language models by improving structured pruning methods, offering incremental advancements over existing local paradigms.

The paper tackles the problem of structured pruning for large language models by introducing GISP, a global iterative pruning method that uses loss-based importance weights and an iterative schedule to improve stability and accuracy at high sparsity levels. Results show consistent reductions in WikiText-2 perplexity and improved downstream accuracy, with strong gains at 40-50% sparsity and task-aligned calibration boosting exact-match accuracy on GSM8K.

Structured pruning is a practical approach to deploying large language models (LLMs) efficiently, as it yields compact, hardware-friendly architectures. However, the dominant local paradigm is task-agnostic: by optimizing layer-wise reconstruction rather than task objectives, it tends to preserve perplexity or generic zero-shot behavior but fails to capitalize on modest task-specific calibration signals, often yielding limited downstream gains. We revisit global structured pruning and present GISP-Global Iterative Structured Pruning-a post-training method that removes attention heads and MLP channels using first-order, loss-based important weights aggregated at the structure level with block-wise normalization. An iterative schedule, rather than one-shot pruning, stabilizes accuracy at higher sparsity and mitigates perplexity collapse without requiring intermediate fine-tuning; the pruning trajectory also forms nested subnetworks that support a "prune-once, deploy-many" workflow. Furthermore, because importance is defined by a model-level loss, GISP naturally supports task-specific objectives; we instantiate perplexity for language modeling and a margin-based objective for decision-style tasks. Extensive experiments show that across Llama2-7B/13B, Llama3-8B, and Mistral-0.3-7B, GISP consistently lowers WikiText-2 perplexity and improves downstream accuracy, with especially strong gains at 40-50% sparsity; on DeepSeek-R1-Distill-Llama-3-8B with GSM8K, task-aligned calibration substantially boosts exact-match accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes