CLJul 16, 2024

MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models

Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi

arXiv:2407.11681v14.89 citationsh-index: 7

Originality Incremental advance

AI Analysis

This addresses the problem of compressing LLMs for researchers and practitioners by offering a more effective pruning method with manageable memory costs, though it is incremental as it builds on existing pruning techniques.

The paper tackles the memory-intensive challenge of using gradients for pruning large language models (LLMs) by proposing MINI-LLM, a memory-efficient structured pruning method that estimates gradients via forward passes and integrates magnitude, activation, and gradient criteria. It shows superior performance over gradient-free methods on models like LLaMA, BLOOM, and OPT across tasks, with a GPU memory footprint similar to gradient-free approaches.

As Large Language Models (LLMs) grow dramatically in size, there is an increasing trend in compressing and speeding up these models. Previous studies have highlighted the usefulness of gradients for importance scoring in neural network compressing, especially in pruning medium-size networks. However, the substantial memory requirements involved in calculating gradients with backpropagation impede the utilization of gradients in guiding LLM pruning. As a result, most pruning strategies for LLMs rely on gradient-free criteria, such as weight magnitudes or a mix of magnitudes and activations. In this paper, we devise a hybrid pruning criterion, which appropriately integrates magnitude, activation, and gradient to capitalize on feature map sensitivity for pruning LLMs. To overcome memory requirement barriers, we estimate gradients using only forward passes. Based on this, we propose a Memory-effIcieNt structured prunIng procedure for LLMs (MINI-LLM) to remove no-critical channels and multi-attention heads. Experimental results demonstrate the superior performance of MINI-LLM over existing gradient-free methods on three LLMs: LLaMA, BLOOM, and OPT across various downstream tasks (classification, multiple-choice, and generation), while MINI-LLM maintains a GPU memory footprint akin to gradient-free methods.

View on arXiv PDF

Similar