DC LGAug 25, 2025

FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

Ran Yan, Youhe Jiang, Zhuoming Chen, Haohui Mai, Beidi Chen, Binhang Yuan

arXiv:2508.18224v27 citationsh-index: 10Has Code

Originality Incremental advance

AI Analysis

This work addresses a performance bottleneck for researchers and practitioners using sparse attention in large language models, though it is incremental as it improves an existing method.

The paper tackles the inefficiency of the Native Sparse Attention (NSA) kernel when used with LLMs that have a small number of query heads per GQA group, proposing Flash Sparse Attention (FSA) as an alternative implementation that achieves up to 3.5x kernel-level latency reduction and up to 1.25x end-to-end training speedup.

Recent advance in sparse attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (LLMs). Native Sparse Attention (NSA), one state-of-the-art approach, introduces natively trainable, hardware-aligned sparse attention that delivers substantial system-level performance boost while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA forces a loop order that is only efficient with a relatively large number of query heads in each Grouped Query Attention (GQA) group, whereas existing LLMs widely adopt much smaller number of query heads in each GQA group -- such an inconsistency significantly limits the applicability of this sparse algorithmic advance. In this work, we propose Flash Sparse Attention (FSA), an alternative kernel implementation that enables efficient NSA computation across a wide range of popular LLMs with varied smaller number of heads in each GQA group on modern GPUs. Compared to vanilla NSA kernel implementation, our empirical evaluation demonstrates that FSA achieves (i) up to 3.5x and on average 1.6x kernel-level latency reduction, (ii) up to 1.25x and 1.09x on average end-to-end training speedup on state-of-the-art LLMs, and (iii) up to 1.36x and 1.11x on average for prefill-phase speedup in LLM generative inference. Github Repo at https://github.com/Relaxed-System-Lab/Flash-Sparse-Attention.

View on arXiv PDF Code

Similar