CLLGOct 15, 2024

In-context KV-Cache Eviction for LLMs via Attention-Gate

arXiv:2410.12876v310 citationsh-index: 10
AI Analysis

This addresses efficiency issues in LLM inference systems, though it is an incremental improvement over existing KV-Cache techniques.

The paper tackles the KV-Cache bottleneck in LLM inference by introducing a dynamic eviction policy using Attention-Gates, which improves efficiency and performance by caching only a subset of tokens.

The KV-Cache technique has become the standard for the inference of large language models (LLMs). Yet, it is widely criticized that KV-Cache can become a bottleneck of the LLM inference system. This paper enables a novel dynamic KV-Cache eviction policy by injecting a lightweight module called Attention-Gate to the model. It accepts the global context as input and yields eviction flags for each token. The self-attention modules in the model proceed according to the flags and cache only a subset of the KV states for next token prediction. The Attention-Gates can yield various flags for different heads and layers and be easily tuned on top of a pre-trained LLM via continual pre-training or supervised fine-tuning. The computational and memory overhead introduced by Attention-Gates can be minimal. We empirically evaluate the proposed approach across multiple scenarios, showing that effective eviction of redundant tokens can not only improve efficiency but also enhance performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes