LGAIPFMay 8

An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

arXiv:2605.0771977.31 citations
Predicted impact top 19% in LG · last 90 daysOriginality Incremental advance
AI Analysis

This work improves inference efficiency for large language models handling long contexts, a critical bottleneck for deployment with limited GPU memory.

Fluxion addresses the inefficiency of long-context inference with CPU-resident KV caches by introducing a hybrid sparse attention mechanism that co-designs budget allocation, sparse configuration, and CPU-GPU execution overlap. It achieves 1.5×–3.7× speedup over the strongest fixed sparse hybrid baseline while maintaining quality degradation within -0.26 on average.

Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data in host memory. Although block-sparse attention reduces attention cost in this setting, sparsity alone is insufficient for end-to-end efficiency. GPU-only designs remain constrained by PCIe bandwidth and metadata memory overhead, while CPU-GPU hybrid designs still suffer from substantial GPU idle time and bottlenecks in CPU-side top-k selection and sparse attention computation. Fluxion is built on three key insights: output-aware KV budgeting, head-specific and granularity-aware sparse configuration, and cross-device coordinated execution for sparse attention over CPU-resident KV caches. Guided by these insights, Fluxion combines a lightweight head-property predictor, a granularity-budget selector, and a priority-based scheduler to jointly optimize budget allocation, sparse configuration, and CPU-GPU execution overlap. This co-design enables hybrid sparse attention to achieve both accuracy and system efficiency in long-context inference. Across 2 models, 3 benchmarks, and 40 tasks, Fluxion preserves quality well -- the worst average degradation is only -0.26 relative to FULL, while delivering 1.5$\times$-3.7$\times$ speedup over the strongest fixed sparse hybrid baseline, whose KV budget is only 0.05.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes