LG AI CLApr 18, 2025

Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs

Lucas Maisonnave, Cyril Moineau, Olivier Bichler, Fabrice Rastello

arXiv:2504.13989v21 citationsh-index: 2

Originality Highly original

AI Analysis

This work solves the problem of efficient deployment of LLMs on resource-constrained devices, representing a strong specific gain rather than a foundational breakthrough.

The paper tackles the challenge of deploying large language models on edge devices by addressing outlier activations that hinder low-bit quantization, achieving 3-bit quantization for weights, activations, and KV caches with a 40% increase in accuracy on benchmarks compared to state-of-the-art methods.

Large language models (LLMs) have become pivotal in artificial intelligence, demonstrating strong capabilities in reasoning, understanding, and generating data. However, their deployment on edge devices is hindered by their substantial size, often reaching several billion parameters. Quantization is a widely used method to reduce memory usage and inference time, however LLMs present unique challenges due to the prevalence of outliers in their activations. In this work, we leverage the theoretical advantages of Hadamard matrices over random rotation matrices to push the boundaries of quantization in LLMs. We demonstrate that Hadamard matrices are more effective in reducing outliers, which are a significant obstacle in achieving low-bit quantization. Our method based on a gradual binary search enables 3-bit quantization for weights, activations, and key-value (KV) caches, resulting in a 40% increase in accuracy on common benchmarks compared to SoTA methods. We extend the use of rotation matrices to support non-power-of-2 embedding dimensions, similar to the Qwen architecture, by employing the Paley algorithm. We theoretically demonstrates the superiority of Hadamard matrices in reducing outliers.We achieved 3-bit quantization for weights, activations, and KV cache, significantly enhancing model performance. Our experimental results on multiple models family like Mistral, LLaMA, and Qwen demonstrate the effectiveness of our approach, outperforming existing methods and enabling practical 3-bit quantization.

View on arXiv PDF

Similar