LGJan 29

Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold

Luca Benfenati, Matteo Risso, Andrea Vannozzi, Ahmet Caner Yüzügüler, Lukas Cavigelli, Enrico Macii, Daniele Jahier Pagliari, Alessio Burrello

arXiv:2601.21686v11.4h-index: 22

Originality Incremental advance

AI Analysis

This addresses memory efficiency issues for large language models during inference, though it is an incremental improvement over existing compression methods.

The paper tackled the bottleneck of KV cache memory and bandwidth in long-context autoregressive decoding by introducing StiefAttention, a post-training compression method that learns orthonormal projection bases to minimize decoder-layer output reconstruction error, resulting in an 11.9-point improvement on C4 perplexity and 5.4% higher 0-shot MMLU accuracy compared to EigenAttention at the same compression level.

Key--value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by projecting per-head matrixes to a lower rank, storing only the projections in the HBM. However, existing post-training approaches typically fit these projections using SVD-style proxy objectives, which may poorly reflect end-to-end reconstruction after softmax, value mixing, and subsequent decoder-layer transformations. For these reasons, we introduce StiefAttention, a post-training KV-cache compression method that learns \emph{orthonormal} projection bases by directly minimizing \emph{decoder-layer output reconstruction error}. StiefAttention additionally precomputes, for each layer, an error-rank profile over candidate ranks, enabling flexible layer-wise rank allocation under a user-specified error budget. Noteworthy, on Llama3-8B under the same conditions, StiefAttention outperforms EigenAttention by $11.9$ points on C4 perplexity and $5.4\%$ on 0-shot MMLU accuracy at iso-compression, yielding lower relative error and higher cosine similarity with respect to the original decoder-layer outputs.

View on arXiv PDF

Similar