PIP: Perturbation-based Iterative Pruning for Large Language Models
This addresses the problem of high computational costs for LLMs in constrained settings, offering an incremental improvement over existing structured pruning methods.
The paper tackles the challenge of deploying large language models (LLMs) in resource-constrained environments by proposing PIP, a perturbation-based iterative pruning method that reduces parameter counts by about 20% while retaining over 85% of the original model's accuracy across benchmarks.
The rapid increase in the parameter counts of Large Language Models (LLMs), which often reach into the billions or even trillions, presents significant challenges for their practical deployment, particularly in resource-constrained environments. To address this issue, we propose PIP (Perturbation-based Iterative Pruning), a novel double-view structured pruning method to optimize LLMs, which combines information from two different views: the unperturbed view and the perturbed view. With the calculation of gradient differences, PIP iteratively prunes those that struggle to distinguish between these two views. Our experiments show that PIP reduces the parameter count by approximately 20% while retaining over 85% of the original model's accuracy across varied benchmarks. In some cases, the performance of the pruned model is within 5% of the unpruned version, demonstrating PIP's ability to preserve key aspects of model effectiveness. Moreover, PIP consistently outperforms existing state-of-the-art (SOTA) structured pruning methods, establishing it as a leading technique for optimizing LLMs in constrained environments.