CLNov 21, 2025

E$^3$-Pruner: Towards Efficient, Economical, and Effective Layer Pruning for Large Language Models

arXiv:2511.17205v12 citations
Originality Incremental advance
AI Analysis

This addresses practical deployment challenges like performance degradation and high training costs for efficient model compression in AI applications, though it is incremental.

The paper tackles the problem of layer pruning for large language models by proposing E^3-Pruner, which achieves 96% accuracy with only a 0.8% drop from the original model on MATH-500 while providing a 1.33x inference speedup.

With the increasing size of large language models, layer pruning has gained increased attention as a hardware-friendly approach for model compression. However, existing layer pruning methods struggle to simultaneously address key practical deployment challenges, including performance degradation, high training costs, and limited acceleration. To overcome these limitations, we propose \name, a task-\underline{E}ffective, training-\underline{E}conomical and inference-\underline{E}fficient layer pruning framework. \namespace introduces two key innovations: (1) a differentiable mask optimization method using a Gumbel-TopK sampler, enabling efficient and precise pruning mask search; and (2) an entropy-aware adaptive knowledge distillation strategy that enhances task performance. Extensive experiments over diverse model architectures and benchmarks demonstrate the superiority of our method over state-of-the-art approaches. Notably, \namespace achieves 96\% accuracy, a mere 0.8\% drop from the original model (96.8\%) on MATH-500 when pruning 25\% layers of Qwen3-32B, outperforming existing SOTA (95\%), with a 1.33$\times$ inference speedup by consuming merely 0.5B tokens (0.5\% of the post-training data volume).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes