LGCLMar 25, 2025

QUAD: Quantization and Parameter-Efficient Tuning of LLM with Activation Decomposition

arXiv:2503.19353v14 citationsh-index: 20Has Code
Originality Incremental advance
AI Analysis

This addresses efficiency issues for users deploying LLMs on resource-constrained devices by improving quantization accuracy, though it is an incremental advance over existing methods.

The paper tackles the problem of accuracy degradation in medium-sized LLMs during 4-bit quantization due to activation outliers, proposing QUAD which uses SVD to suppress outliers and achieves 94-96% accuracy under W4A4 quantization and up to 98% with fine-tuning for models like Llama-3 and Qwen-2.5.

Large Language Models (LLMs) excel in diverse applications but suffer inefficiency due to massive scale. While quantization reduces computational costs, existing methods degrade accuracy in medium-sized LLMs (e.g., Llama-3-8B) due to activation outliers. To address this, we propose QUAD (Quantization with Activation Decomposition), a framework leveraging Singular Value Decomposition (SVD) to suppress activation outliers for effective 4-bit quantization. QUAD estimates activation singular vectors offline using calibration data to construct an orthogonal transformation matrix P, shifting outliers to additional dimensions in full precision while quantizing rest components to 4-bit. Additionally, QUAD enables parameter-efficient fine-tuning via adaptable full-precision outlier weights, narrowing the accuracy gap between quantized and full-precision models. Experiments demonstrate that QUAD achieves 94% ~ 96% accuracy under W4A4 quantization and 98% accuracy with W4A4/A8 and parameter-efficient fine-tuning for Llama-3 and Qwen-2.5 models. Our code is available at \href{https://github.com/hyx1999/Quad}{repository}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes