CL AIMay 27

PrunePath: Towards Highly Structured Sparse Language Models

arXiv:2605.2828348.6

AI Analysis

For practitioners deploying large language models, PrunePath provides a method to convert sparsity into hardware-friendly inference efficiency gains, addressing a key bottleneck in model deployment.

PrunePath introduces a budget-adaptive structured sparsification framework for FFN layers in language models, achieving favorable sparsity-performance trade-offs across NLU, NLG, and instruction-tuning tasks, with practical memory savings and decoding-speed improvements via Triton kernels.

Feed-forward networks (FFNs) dominate the parameter count and computation of modern language models, yet existing pruning methods often struggle to convert sparsity into hardware-friendly inference efficiency gains. We introduce \textbf{PrunePath}, a budget-adaptive structured sparsification framework for FFN layers. Built on MoEfication, PrunePath replaces independent expert-wise thresholding with a softmax-normalized routing distribution and activates important experts under a cumulative-mass threshold. This formulation imposes a token-level probability budget, enabling adaptive expert counts and a direct inference-time sparsity knob from a single checkpoint. Across NLU, NLG, and instruction-tuning evaluations, PrunePath achieves a favorable sparsity--performance trade-off compared with existing static pruning and MoEfication-based methods. We further implement Triton kernels for KV-cache decoding to translate the resulting structured sparsity into practical memory savings and measurable decoding-speed improvements. These results demonstrate the superior performance of PrunePath for building highly sparse, deployment-friendly large language models.

View on arXiv PDF

Similar