ASER: Activation Smoothing and Error Reconstruction for Large Language Model Quantization
This work addresses the problem of efficient model serving for LLMs by improving quantization techniques, though it appears incremental as it builds on existing compression objectives and methods.
The paper tackles the challenge of low-bit quantization for large language models, which causes significant performance degradation, by introducing ASER, an algorithm that uses error reconstruction and activation smoothing to preserve accuracy, achieving competitive results in W4A8 per-channel setups with minor overhead.
Quantization stands as a pivotal technique for large language model (LLM) serving, yet it poses significant challenges particularly in achieving effective low-bit quantization. The limited numerical mapping makes the quantized model produce a non-trivial error, bringing out intolerable performance degration. This paper is anchored in the basic idea of model compression objectives, and delves into the layer-wise error distribution of LLMs during post-training quantization. Subsequently, we introduce ASER, an algorithm consisting of (1) Error Reconstruction: low-rank compensation for quantization error with LoRA-style matrices constructed by whitening SVD; (2) Activation Smoothing: outlier extraction to gain smooth activation and better error compensation. ASER is capable of quantizing typical LLMs to low-bit ones, particularly preserving accuracy even in W4A8 per-channel setup. Experimental results show that ASER is competitive among the state-of-the-art quantization algorithms, showing potential to activation quantization, with minor overhead.