CLApr 16

Latent-Condensed Transformer for Efficient Long Context Modeling

Zeng You, Yaofo Chen, Qiuwu Chen, Ying Sun, Shuhai Zhang, Yingjian Li, Yaowei Wang, Mingkui Tan

arXiv:2604.1245268.6h-index: 9

Predicted impact top 89% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This work addresses the efficiency bottleneck of long-context modeling for LLMs, offering a parameter-free method that jointly optimizes computation and memory.

The paper proposes Latent-Condensed Attention (LCA), a method that jointly reduces both computational cost and KV cache for long-context LLMs by condensing context within MLA's latent space. LCA achieves up to 2.5× prefilling speedup and 90% KV cache reduction at 128K context while maintaining competitive performance.

Large language models (LLMs) face significant challenges in processing long contexts due to the linear growth of the key-value (KV) cache and quadratic complexity of self-attention. Existing approaches address these bottlenecks separately: Multi-head Latent Attention (MLA) reduces the KV cache by projecting tokens into a low-dimensional latent space, while sparse attention reduces computation. However, sparse methods cannot operate natively on MLA's compressed latent structure, missing opportunities for joint optimization. In this paper, we propose Latent-Condensed Attention (LCA), which directly condenses context within MLA's latent space, where the representation is disentangled into semantic latent vectors and positional keys. LCA separately aggregates semantic vectors via query-aware pooling and preserves positional keys via anchor selection. This approach jointly reduces both computational cost and KV cache without adding parameters. Beyond MLA, LCA's design is architecture-agnostic and readily extends to other attention mechanisms such as GQA. Theoretically, we prove a length-independent error bound. Experiments show LCA achieves up to 2.5$\times$ prefilling speedup and 90% KV cache reduction at 128K context while maintaining competitive performance.

View on arXiv PDF

Similar