LGDec 5, 2025

KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity

arXiv:2512.05916v13 citations
Originality Incremental advance
AI Analysis

This addresses the memory efficiency problem for large language model inference, though it appears incremental as it builds on prior compression methods by targeting a more optimal approach.

The paper tackles the memory bottleneck of the Key-Value (KV) cache in transformer-based LLMs by introducing KQ-SVD, a method that directly compresses the attention matrix with a closed-form low-rank decomposition, resulting in higher fidelity attention outputs as demonstrated in evaluations on LLaMA and Mistral models.

The Key-Value (KV) cache is central to the efficiency of transformer-based large language models (LLMs), storing previously computed vectors to accelerate inference. Yet, as sequence length and batch size grow, the cache becomes a major memory bottleneck. Prior compression methods typically apply low-rank decomposition to keys alone or attempt to jointly embed queries and keys, but both approaches neglect that attention fundamentally depends on their inner products. In this work, we prove that such strategies are suboptimal for approximating the attention matrix. We introduce KQ-SVD, a simple and computationally efficient method that directly performs an optimal low-rank decomposition of the attention matrix via a closed-form solution. By targeting the true source of redundancy, KQ-SVD preserves attention outputs with higher fidelity under compression. Extensive evaluations on LLaMA and Mistral models demonstrate that our approach consistently delivers superior projection quality.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes