LG AIJun 10, 2025

SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

Yizhao Gao, Shuming Guo, Shijie Cao, Yuqing Xia, Yu Cheng, Lei Wang, Lingxiao Ma, Yutao Sun, Tianzhu Ye, Li Dong, Hayden Kwok-Hay So, Yu Hua

Tsinghua

arXiv:2506.08889v125.521 citationsh-index: 31Has Code

Originality Incremental advance

AI Analysis

This addresses computational bottlenecks in long-context reasoning for AI models, though it builds incrementally on prior sparse attention work.

The paper tackles efficient long-context reasoning by introducing SeerAttention-R, a sparse attention framework that maintains near-lossless reasoning accuracy with 4K tokens while achieving up to 9x speedup over FlashAttention-3 on H100 GPUs at 90% sparsity.

We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. We demonstrate that SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). Using TileLang, we develop a highly optimized sparse decoding kernel that achieves near-theoretical speedups of up to 9x over FlashAttention-3 on H100 GPU at 90% sparsity. Code is available at: https://github.com/microsoft/SeerAttention.

View on arXiv PDF Code

Similar