LGAICLOct 24, 2024

KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing

arXiv:2410.18517v131 citationsh-index: 20
Originality Incremental advance
AI Analysis

This addresses memory efficiency for LLM inference, offering a plug-and-play solution that is incremental but novel in its layer-wise approach and compatibility with existing methods.

The paper tackles the problem of high GPU memory consumption during large language model inference by proposing KVSharer, a method that shares KV caches between layers to reduce memory usage by 30% and achieve at least 1.3 times generation acceleration without significantly impacting performance.

The development of large language models (LLMs) has significantly expanded model sizes, resulting in substantial GPU memory requirements during inference. The key and value storage of the attention map in the KV (key-value) cache accounts for more than 80\% of this memory consumption. Nowadays, most existing KV cache compression methods focus on intra-layer compression within a single Transformer layer but few works consider layer-wise compression. In this paper, we propose a plug-and-play method called \textit{KVSharer}, which shares the KV cache between layers to achieve layer-wise compression. Rather than intuitively sharing based on higher similarity, we discover a counterintuitive phenomenon: sharing dissimilar KV caches better preserves the model performance. Experiments show that \textit{KVSharer} can reduce KV cache computation by 30\%, thereby lowering memory consumption without significantly impacting model performance and it can also achieve at least 1.3 times generation acceleration. Additionally, we verify that \textit{KVSharer} is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes