CLAIFeb 3

HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing

arXiv:2602.03560v13 citationsh-index: 6
Originality Incremental advance
AI Analysis

This addresses the problem of high computational and memory costs in large language models for AI researchers and practitioners, representing an incremental improvement over prior sparse attention methods.

The paper tackles the inefficiency of sparse attention methods by introducing HySparse, which interleaves full and sparse attention layers to use the full layer as an oracle for token selection and share KV caches, achieving substantial performance gains and reducing KV cache storage by nearly 10x in an 80B MoE model.

This work introduces Hybrid Sparse Attention (HySparse), a new architecture that interleaves each full attention layer with several sparse attention layers. While conceptually simple, HySparse strategically derives each sparse layer's token selection and KV caches directly from the preceding full attention layer. This architecture resolves two fundamental limitations of prior sparse attention methods. First, conventional approaches typically rely on additional proxies to predict token importance, introducing extra complexity and potentially suboptimal performance. In contrast, HySparse uses the full attention layer as a precise oracle to identify important tokens. Second, existing sparse attention designs often reduce computation without saving KV cache. HySparse enables sparse attention layers to reuse the full attention KV cache, thereby reducing both computation and memory. We evaluate HySparse on both 7B dense and 80B MoE models. Across all settings, HySparse consistently outperforms both full attention and hybrid SWA baselines. Notably, in the 80B MoE model with 49 total layers, only 5 layers employ full attention, yet HySparse achieves substantial performance gains while reducing KV cache storage by nearly 10x.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes