LGJun 11, 2025

AWP: Activation-Aware Weight Pruning and Quantization with Projected Gradient Descent

Jing Liu, Toshiaki Koike-Akino, Ye Wang, Hassan Mansour, Matthew Brand

arXiv:2506.10205v17.12 citationsh-index: 58

Originality Incremental advance

AI Analysis

This addresses the challenge of deploying large models on resource-constrained devices, though it appears incremental as it builds on existing compression techniques.

The paper tackles the problem of compressing Large Language Models (LLMs) for edge devices by proposing a unified method for activation-aware weight pruning and quantization, which outperforms state-of-the-art methods in experiments.

To address the enormous size of Large Language Models (LLMs), model compression methods, such as quantization and pruning, are often deployed, especially on edge devices. In this work, we focus on layer-wise post-training quantization and pruning. Drawing connections between activation-aware weight pruning and sparse approximation problems, and motivated by the success of Iterative Hard Thresholding (IHT), we propose a unified method for Activation-aware Weight pruning and quantization via Projected gradient descent (AWP). Our experiments demonstrate that AWP outperforms state-of-the-art LLM pruning and quantization methods. Theoretical convergence guarantees of the proposed method for pruning are also provided.

View on arXiv PDF

Similar