LGCLDec 12, 2024

CRVQ: Channel-Relaxed Vector Quantization for Extreme Compression of LLMs

Tsinghua
arXiv:2412.09282v24 citationsh-index: 9
Originality Highly original
AI Analysis

This addresses the need for efficient LLM deployment on limited hardware, representing a strong incremental advance in extreme compression techniques.

The paper tackles the problem of compressing large language models (LLMs) for deployment on resource-constrained devices by proposing Channel-Relaxed Vector Quantization (CRVQ), which improves post-training quantization baselines by 38.9% over the strongest sub-2-bit method, enabling near-lossless 1-bit compression.

Powerful large language models (LLMs) are increasingly expected to be deployed with lower computational costs, enabling their capabilities on resource-constrained devices. Post-training quantization (PTQ) has emerged as a star approach to achieve this ambition, with best methods compressing weights to less than 2 bit on average. In this paper, we propose Channel-Relaxed Vector Quantization (CRVQ), a novel technique that significantly improves the performance of PTQ baselines at the cost of only minimal additional bits. This state-of-the-art extreme compression method achieves its results through two key innovations: (1) carefully selecting and reordering a very small subset of critical weight channels, and (2) leveraging extended codebooks to relax the constraint of critical channels. With our method, we demonstrate a 38.9\% improvement over the current strongest sub-2-bit PTQ baseline, enabling nearer lossless 1-bit compression. Furthermore, our approach offers flexible customization of quantization bit-width and performance, providing a wider range of deployment options for diverse hardware platforms.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes