LGAIMay 23, 2025

NSNQuant: A Double Normalization Approach for Calibration-Free Low-Bit Vector Quantization of KV Cache

arXiv:2505.18231v1h-index: 1
Originality Incremental advance
AI Analysis

This addresses the problem of high memory usage in LLM inference for users handling large batch sizes and long sequences, offering a novel calibration-free approach that is incremental over existing vector quantization methods.

The paper tackles the memory-intensive issue in Large Language Model inference by introducing NSNQuant, a calibration-free vector quantization technique for low-bit compression of the KV cache, which outperforms prior methods and achieves up to 3x throughput gain over full-precision baselines.

Large Language Model (LLM) inference is typically memory-intensive, especially when processing large batch sizes and long sequences, due to the large size of key-value (KV) cache. Vector Quantization (VQ) is recently adopted to alleviate this issue, but we find that the existing approach is susceptible to distribution shift due to its reliance on calibration datasets. To address this limitation, we introduce NSNQuant, a calibration-free Vector Quantization (VQ) technique designed for low-bit compression of the KV cache. By applying a three-step transformation-1) a token-wise normalization (Normalize), 2) a channel-wise centering (Shift), and 3) a second token-wise normalization (Normalize)-with Hadamard transform, NSNQuant effectively aligns the token distribution with the standard normal distribution. This alignment enables robust, calibration-free vector quantization using a single reusable codebook. Extensive experiments show that NSNQuant consistently outperforms prior methods in both 1-bit and 2-bit settings, offering strong generalization and up to 3$\times$ throughput gain over full-precision baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes