CLJun 27, 2024

OutlierTune: Efficient Channel-Wise Quantization for Large Language Models

arXiv:2406.18832v14 citations
Originality Incremental advance
AI Analysis

This addresses the problem of hardware efficiency and accuracy in LLM quantization for AI practitioners, offering a practical solution with significant performance gains.

The paper tackles the challenge of quantizing activations in large language models (LLMs) due to structured outliers, proposing OutlierTune, an efficient per-channel post-training quantization method that achieves accuracy comparable to half-precision (FP16) while being 1.48x faster and reducing memory usage by approximately 2x.

Quantizing the activations of large language models (LLMs) has been a significant challenge due to the presence of structured outliers. Most existing methods focus on the per-token or per-tensor quantization of activations, making it difficult to achieve both accuracy and hardware efficiency. To address this problem, we propose OutlierTune, an efficient per-channel post-training quantization (PTQ) method for the activations of LLMs. OutlierTune consists of two components: pre-execution of dequantization and symmetrization. The pre-execution of dequantization updates the model weights by the activation scaling factors, avoiding the internal scaling and costly additional computational overheads brought by the per-channel activation quantization. The symmetrization further reduces the quantization differences arising from the weight updates by ensuring the balanced numerical ranges across different activation channels. OutlierTune is easy to implement and hardware-efficient, introducing almost no additional computational overheads during the inference. Extensive experiments show that the proposed framework outperforms existing methods across multiple different tasks. Demonstrating better generalization, this framework improves the Int6 quantization of the instruction-tuning LLMs, such as OPT-IML, to the same level as half-precision (FP16). Moreover, we have shown that the proposed framework is 1.48x faster than the FP16 implementation while reducing approximately 2x memory usage.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes