LGDec 5, 2024

SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization

arXiv:2412.04180v210.44 citationsh-index: 104ICML

Originality Incremental advance

AI Analysis

This addresses the challenge of efficient deployment of LLMs for inference by enabling any-bit quantization with improved performance, though it is incremental as it builds on existing quantization techniques.

The paper tackles the problem of performance drops in quantized large language models by proposing SKIM, a method that uses scaled K-means clustering with mixed precision, which reduces the perplexity gap between 3-bit quantized and full-precision LLaMA models by 16.3% on average.

Large Language Models (LLMs) exhibit impressive performance across various tasks, but deploying them for inference poses challenges. Their high resource demands often necessitate complex, costly multi-GPU pipelines, or the use of smaller, less capable models. While quantization offers a promising solution utilizing lower precision for model storage, existing methods frequently experience significant performance drops at lower precision levels. Additionally, they typically provide only a limited set of solutions at specific bit levels, many of which are extensively manually tuned. To address these challenges, we propose a new method called SKIM: Scaled K-means clustering wIth Mixed precision. Our approach introduces two novel techniques: 1. A greedy algorithm to solve approximately optimal bit allocation across weight channels, and 2. A trainable scaling vector for non-differentiable K-means clustering. These techniques substantially improve performance and can be adapted to any given bit. Notably, in terms of model perplexity, our method narrows the gap between 3-bit quantized LLaMA models and their full precision counterparts by 16.3% on average.

View on arXiv PDF

Similar