CLOct 13, 2025

XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression

Haoqi Yang, Yao Yao, Zuchao Li, Baoyuan Qi, Guoming Liu, Hai Zhao

arXiv:2510.11236v13 citationsh-index: 24EMNLP

Originality Incremental advance

AI Analysis

This addresses memory efficiency challenges for deploying LLMs in resource-constrained environments, representing an incremental improvement over existing quantization methods.

The paper tackles the high memory demands of Large Language Models (LLMs) due to KV cache growth by proposing XQuant, a training-free framework that achieves ultra-low bit-width KV cache quantization, outperforming state-of-the-art methods like KIVI-2bit and AsymKV-1.5bit with sub-1.4 bits while maintaining superior performance on benchmarks such as TruthfulQA and LongBench.

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their extensive memory requirements, particularly due to KV cache growth during long-text understanding and generation, present significant challenges for deployment in resource-constrained environments. Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information. We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization. XQuant introduces two key innovations: a computationally negligible data-free calibration method and cross-layer KV cache compression, enabling quantization to sub-1.4 bits. Extensive experiments on TruthfulQA and LongBench demonstrate that XQuant outperforms state-of-the-art methods (e.g., KIVI-2bit and AsymKV-1.5bit) by achieving lower bit-width while maintaining superior performance, establishing a better trade-off between memory efficiency and model accuracy.

View on arXiv PDF

Similar