EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction
This addresses memory efficiency for LLM deployment in long-context scenarios, offering a flexible solution with incremental improvements over prior compression methods.
The paper tackles the memory bottleneck of Key-Value (KV) cache in Large Language Models for long-context applications by proposing EchoKV, a flexible compression scheme that enables on-demand transitions between standard and compressed inference, achieving consistent performance gains over existing methods across various compression ratios on benchmarks like LongBench and RULER.
The increasing memory demand of the Key-Value (KV) cache poses a significant bottleneck for Large Language Models (LLMs) in long-context applications. Existing low-rank compression methods often rely on irreversible parameter transformations, sacrificing the flexibility to switch back to full-precision inference when memory is abundant. In this paper, we propose EchoKV, a flexible KV cache compression scheme that enables on-demand transitions between standard and compressed inference. Unlike traditional compression-decompression paradigms, EchoKV utilizes a lightweight network to reconstruct the residual KV components from a partial subset, leveraging intrinsic inter-layer and intra-layer similarities among attention heads. We further introduce a two-stage fine-tuning strategy that allows for rapid, low-cost training (e.g., ~1 A100 GPU-hour for a 7B model). Experimental results on LongBench and RULER demonstrate that EchoKV consistently outperforms existing methods across various compression ratios while maintaining high throughput for short-context scenarios.