AICLFeb 2

More Than a Quick Glance: Overcoming the Greedy Bias in KV-Cache Compression

arXiv:2602.02199v11 citations
AI Analysis

This addresses memory efficiency issues for deploying LLMs with long contexts, but it is incremental as it builds on existing compression strategies.

The paper tackled the problem of KV-cache compression in Large Language Models, which constrains deployment due to memory growth, by introducing LASER-KV, a framework that maintains stable performance and achieves up to 10% higher accuracy at 128k context length compared to previous methods that degrade by 15-30%.

While Large Language Models (LLMs) can theoretically support extensive context windows, their actual deployment is constrained by the linear growth of Key-Value (KV) cache memory. Prevailing compression strategies mitigate this through various pruning mechanisms, yet trade-off semantic recall for memory efficiency. In this work, we present LASER-KV (Layer Accumulated Selection with Exact-LSH Recall), a framework designed to test the limits of KV compression under a strict accumulative budgeting policy. We deviate from the standard fixed summary size approach by implementing a block-wise accumulation strategy governed by a protection divisor (n). This allows us to isolate the effects of compression from sliding window artifacts. Our experiments on the Babilong benchmark reveal performance degradation in previous compression methods by 15-30% on various long context tasks. LASER-KV maintains stable performance, achieving superior accuracies by a margin of upto 10% at 128k. These findings challenge the prevailing assumption that attention scores alone are a sufficient proxy for token utility.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes