CLJun 4, 2025

Rectified Sparse Attention

Yutao Sun, Tianzhu Ye, Li Dong, Yuqing Xia, Jian Chen, Yizhao Gao, Shijie Cao, Jianyong Wang, Furu Wei

Tsinghua

arXiv:2506.04108v29 citationsh-index: 11Has Code

Originality Incremental advance

AI Analysis

This addresses efficiency challenges in long-context inference for LLMs, offering a practical incremental improvement over existing sparse methods.

The paper tackles the problem of KV cache misalignment in sparse decoding methods for long-sequence generation in Large Language Models, proposing Rectified Sparse Attention (ReSA) which achieves near-lossless generation quality with up to 2.42× end-to-end speedup at 256K sequence length.

Efficient long-sequence generation is a critical challenge for Large Language Models. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification. By refreshing the KV cache at fixed intervals using a dense forward pass, ReSA bounds error accumulation and preserves alignment with the pretraining distribution. Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality with significantly improved efficiency. Notably, ReSA delivers up to 2.42$\times$ end-to-end speedup under decoding at 256K sequence length, making it a practical solution for scalable long-context inference. Code is available at https://aka.ms/ReSA-LM.

View on arXiv PDF

Similar