LG CLMay 23, 2024

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, Xiaojuan Qi

arXiv:2405.14917v229.965 citationsh-index: 24Has CodeICML

Originality Incremental advance

AI Analysis

This work addresses the challenge of efficient deployment of LLMs for applications requiring reduced memory and computational costs, representing an incremental improvement over existing quantization methods.

The paper tackles the problem of compressing large language models (LLMs) through post-training quantization, proposing SliM-LLM, a salience-driven mixed-precision framework that improves accuracy while maintaining efficiency. For example, a 2-bit quantized LLaMA-7B model reduces memory usage by nearly 6x and decreases perplexity by 48% compared to state-of-the-art gradient-free methods.

Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs). However, while uniform-precision quantization is computationally efficient, it often compromises model performance. To address this, we propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise. Our approach leverages the observation that important weights follow a structured distribution and introduces two key components: \textbf{1)} \textit{Salience-Determined Bit Allocation} adaptively assigns bit-widths to groups within each layer based on their salience; and \textbf{2)} \textit{Salience-Weighted Quantizer Calibration} optimizes quantizer parameters by incorporating element-level salience. With its structured partitioning, SliM-LLM provides a hardware-friendly solution that matches the efficiency of uniform quantization methods while improving accuracy. Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths. For example, a 2-bit quantized LLaMA-7B model reduces memory usage by nearly 6x compared to the floating-point baseline, decreases perplexity by 48\% compared to state-of-the-art gradient-free PTQ methods, and maintains GPU inference speed. Additionally, the extended version, SliM-LLM$^+$, which incorporates gradient-based quantization, further reduces perplexity by 35.1\%. Our code is available at https://github.com/Aaronhuang-778/SliM-LLM

View on arXiv PDF Code

Similar