LGAIJan 25, 2025

Lightweight and Post-Training Structured Pruning for On-Device Large Lanaguage Models

arXiv:2501.15255v12 citationsh-index: 23
Originality Incremental advance
AI Analysis

This work addresses the challenge of deploying large language models on devices with limited resources, offering an incremental improvement over existing pruning methods.

The paper tackles the problem of reducing resource demands for large language models on resource-constrained devices by introducing COMP, a lightweight post-training structured pruning method that improves performance by 6.13% on the LLaMA-2-7B model with a 20% pruning ratio compared to LLM-Pruner while reducing memory overhead by 80%.

Considering the hardware-friendly characteristics and broad applicability, structured pruning has emerged as an efficient solution to reduce the resource demands of large language models (LLMs) on resource-constrained devices. Traditional structured pruning methods often need fine-tuning to recover performance loss, which incurs high memory overhead and substantial data requirements, rendering them unsuitable for on-device applications. Additionally, post-training structured pruning techniques typically necessitate specific activation functions or architectural modifications, thereby limiting their scope of applications. Herein, we introduce COMP, a lightweight post-training structured pruning method that employs a hybrid-granularity pruning strategy. COMP initially prunes selected model layers based on their importance at a coarse granularity, followed by fine-grained neuron pruning within the dense layers of each remaining model layer. To more accurately evaluate neuron importance, COMP introduces a new matrix condition-based metric. Subsequently, COMP utilizes mask tuning to recover accuracy without the need for fine-tuning, significantly reducing memory consumption. Experimental results demonstrate that COMP improves performance by 6.13\% on the LLaMA-2-7B model with a 20\% pruning ratio compared to LLM-Pruner, while simultaneously reducing memory overhead by 80\%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes