LGOct 6, 2025

KVLinC : KV Cache Quantization with Hadamard Rotation and Linear Correction

arXiv:2510.05373v14 citationsh-index: 10

Originality Incremental advance

AI Analysis

This addresses efficiency issues in long-context LLM inference for users needing faster and more compressed models, though it is incremental as it builds on existing quantization strategies.

The paper tackles the problem of significant errors in large language model inference when aggressively quantizing the key-value cache to very low precision, proposing KVLinC to mitigate these errors and achieving up to 2.55x faster inference while matching or surpassing baseline performance.

Quantizing the key-value (KV) cache is a promising strategy for improving the inference efficiency of large language models (LLMs). However, aggressive quantization to very low precision (e.g., 2 bits) introduces significant errors in the stored key and value tensors, which propagate through the dot-product attention mechanism and ultimately degrade generation quality. To address this, we propose KVLinC, a framework to mitigate attention errors introduced by KV cache quantization in the extreme low-precision regime. KVLinC combines a Hadamard rotation, which reduces quantization error in values, with lightweight linear correction adapters that explicitly compensate for errors introduced by quantized keys. Across extensive evaluations on the LLaMA, Qwen2.5, and Qwen3 model families, KVLinC consistently matches or surpasses strong baselines while achieving higher KV-cache compression. Furthermore, we implement a custom attention kernel that results in upto 2.55x faster inference compared to Flash Attention baseline, enabling efficient long-context LLM inference.

View on arXiv PDF

Similar