CLDec 28, 2025

Improving Generalization in LLM Structured Pruning via Function-Aware Neuron Grouping

Tao Yu, Yongqi An, Kuan Zhu, Guibo Zhu, Ming Tang, Jinqiao Wang

arXiv:2512.23014v14.91 citationsh-index: 26

Originality Incremental advance

AI Analysis

This addresses the computational and storage costs of LLMs for AI practitioners by improving pruning generalization, though it is incremental as it builds on existing pruning methods like FLAP and OBC.

The paper tackles the problem of limited generalization in post-training structured pruning for Large Language Models (LLMs) when calibration sets are biased, proposing Function-Aware Neuron Grouping (FANG) to improve downstream accuracy while preserving language modeling performance, achieving state-of-the-art results with 1.5%--8.5% higher average accuracy under 30% and 40% sparsity compared to existing methods.

Large Language Models (LLMs) demonstrate impressive performance across natural language tasks but incur substantial computational and storage costs due to their scale. Post-training structured pruning offers an efficient solution. However, when few-shot calibration sets fail to adequately reflect the pretraining data distribution, existing methods exhibit limited generalization to downstream tasks. To address this issue, we propose Function-Aware Neuron Grouping (FANG), a post-training pruning framework that alleviates calibration bias by identifying and preserving neurons critical to specific function. FANG groups neurons with similar function based on the type of semantic context they process and prunes each group independently. During importance estimation within each group, tokens that strongly correlate with the functional role of the neuron group are given higher weighting. Additionally, FANG also preserves neurons that contribute across multiple context types. To achieve a better trade-off between sparsity and performance, it allocates sparsity to each block adaptively based on its functional complexity. Experiments show that FANG improves downstream accuracy while preserving language modeling performance. It achieves the state-of-the-art (SOTA) results when combined with FLAP and OBC, two representative pruning methods. Specifically, FANG outperforms FLAP and OBC by 1.5%--8.5% in average accuracy under 30% and 40% sparsity.

View on arXiv PDF

Similar