OSMay 19

SpecSA: Bridging Speculative Decoding and Sparse Attention for Efficient LLM Inference

Zhibin Wang, Ziyu Zhong, Nuo Shen, Yuhang Zhou, Rong Gu, Sheng Zhong

arXiv:2605.1989351.9

Predicted impact top 31% in OS · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners deploying long-context LLMs, SpecSA provides a practical framework that combines two complementary acceleration techniques, overcoming their structural mismatch to deliver substantial speedups.

SpecSA bridges speculative decoding and dynamic sparse attention for efficient LLM inference, achieving up to 3.49x end-to-end throughput over autoregressive NSA decoding and up to 6.86x kernel speedups for sparse speculative verification on NVIDIA H100 GPUs.

Speculative decoding and dynamic sparse attention are two complementary approaches for accelerating long-context LLM inference: the former amortizes target-model execution across multiple verifier queries, while the latter reduces each query's KV-cache working set. Directly combining them, however, exposes a structural mismatch: speculative verification relies on cross-query commonality, whereas dynamic sparse attention assigns query-specific sparse layouts. This mismatch limits KV-block reuse, amplifies NSA's branch-wise overheads, and makes verification strategy selection input- and regime-dependent. We present SpecSA, a sparse speculative-verification framework that turns dynamic sparse attention into a verification-oriented workload. SpecSA combines overlap-aware grouped-query execution, refresh/reuse-based NSA kernel fusion, and profile-guided prompt-adaptive orchestration to improve cross-query reuse, reduce selected-index and branch-fusion overheads, and select effective draft-verification strategies under user-specified precision classes. Experiments on NVIDIA H100 GPUs show that SpecSA achieves up to 3.49x end-to-end throughput over autoregressive NSA decoding and up to 6.86x kernel speedups for sparse speculative verification.

View on arXiv PDF

Similar