LGAIOct 25, 2025

Efficient Low Rank Attention for Long-Context Inference in Large Language Models

arXiv:2510.23649v12 citationsh-index: 39Has Code
Originality Incremental advance
AI Analysis

This addresses memory constraints for deploying LLMs on resource-constrained devices, offering an incremental improvement over existing methods like quantization and pruning.

The paper tackles the prohibitive GPU memory costs of key-value caches in large language models for long-context inference by introducing LRQK, a two-stage framework that decomposes query and key matrices into low-rank factors, achieving significant memory savings with minimal accuracy loss on benchmarks like RULER and LongBench.

As the length of input text grows, the key-value (KV) cache in LLMs imposes prohibitive GPU memory costs and limits long-context inference on resource constrained devices. Existing approaches, such as KV quantization and pruning, reduce memory usage but suffer from numerical precision loss or suboptimal retention of key-value pairs. We introduce Low Rank Query and Key attention (LRQK), a two-stage framework that jointly decomposes the full-precision query and key matrices into compact rank-\(r\) factors during the prefill stage, and then uses these low-dimensional projections to compute proxy attention scores in \(\mathcal{O}(lr)\) time at each decode step. By selecting only the top-\(k\) tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU-CPU cache with a hit-and-miss mechanism that transfers only missing full-precision KV pairs, thereby preserving exact attention outputs while reducing CPU-GPU data movement. Extensive experiments on the RULER and LongBench benchmarks with LLaMA-3-8B and Qwen2.5-7B demonstrate that LRQK matches or surpasses leading sparse-attention methods in long context settings, while delivering significant memory savings with minimal loss in accuracy. Our code is available at https://github.com/tenghuilee/LRQK.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes