CLJun 13, 2025

Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache

Xiaoran Liu, Siyang He, Qiqi Wang, Ruixiao Li, Yuerong Song, Zhigeng Liu, Linlin Li, Qun Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu

arXiv:2506.11886v112.05 citationsh-index: 14

Originality Incremental advance

AI Analysis

This addresses memory efficiency for LLM deployment, particularly for long-context applications, though it is an incremental improvement over existing compression methods.

The paper tackles the memory demands of Key-Value (KV) caches in Large Language Models by proposing FourierAttention, a training-free framework that approximates long-context-insensitive dimensions using Fourier bases, achieving the best long-context accuracy on LongBench and Needle-In-A-Haystack benchmarks.

Large Language Models struggle with memory demands from the growing Key-Value (KV) cache as context lengths increase. Existing compression methods homogenize head dimensions or rely on attention-guided token pruning, often sacrificing accuracy or introducing computational overhead. We propose FourierAttention, a training-free framework that exploits the heterogeneous roles of transformer head dimensions: lower dimensions prioritize local context, while upper ones capture long-range dependencies. By projecting the long-context-insensitive dimensions onto orthogonal Fourier bases, FourierAttention approximates their temporal evolution with fixed-length spectral coefficients. Evaluations on LLaMA models show that FourierAttention achieves the best long-context accuracy on LongBench and Needle-In-A-Haystack (NIAH). Besides, a custom Triton kernel, FlashFourierAttention, is designed to optimize memory via streamlined read-write operations, enabling efficient deployment without performance compromise.

View on arXiv PDF

Similar