SurfaceLogicKV: Surface and Logic Attention Behaviors are All You Need for Robust KV Cache Compression
This work addresses efficient inference for LLMs, offering a domain-specific incremental improvement in KV cache compression.
The paper tackles the challenge of KV cache storage pressure in LLMs due to increasing input sequence lengths by proposing SurfaceLogicKV, a two-stage compression method based on attention behaviors, achieving improved compression robustness and competitive performance across tasks and long sequences.
The increasing input sequence length in Large Language Models (LLMs) puts significant pressure on key-value (KV) cache storage, making efficient inference challenging. Explicitly distinguishing attention behavior into our self-defined surface memorization and logic construction reveals essential roles in long-context reasoning. We observe that an individual attention head can display various behaviors, with nearly 98.5% effectively ignoring completely irrelevant information. The remaining 1.5% behaves as logic construction, and 0.5% behaves as surface memorization. Based on layer- and head-wise integration, we propose a novel two-stage SurfaceLogicKV method to utilize these attention behaviors for KV Cache compression. As a result, it achieves improved compressing robustness while maintaining competitive performance across various tasks and long sequences compared to baselines or even FullKV in some specific situations