CLNov 8, 2024

RefreshKV: Updating Small KV Cache During Long-form Generation

arXiv:2411.05787v26.68 citationsh-index: 8Has CodeACL

Originality Incremental advance

AI Analysis

This addresses a bottleneck in efficient long-form generation for LLM users, though it is an incremental improvement over existing KV compression methods.

The paper tackles the problem of performance degradation in long-form generation when using compressed KV caches for inference speedup in LLMs, proposing RefreshKV which alternates between full and compressed attention to update the cache during generation, achieving comparable speedup to existing methods while improving performance on various tasks.

Generating long sequences of tokens given a long-context input is a very compute-intensive inference scenario for large language models (LLMs). One prominent inference speed-up approach is to construct a smaller key-value (KV) cache, relieving LLMs from computing attention over a long sequence of tokens. While such methods work well to generate short sequences, their performance degrades rapidly for long-form generation. Most KV compression happens once, prematurely removing tokens that can be useful later in the generation. We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation. After each full attention step, we update the smaller KV cache based on the attention pattern over the entire input. Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks. Lastly, we show that continued pretraining with our inference setting brings further gains in performance.

View on arXiv PDF Code

Similar