LG AIFeb 26

S2O: Early Stopping for Sparse Attention via Online Permutation

Yu Zhang, Songwei Liu, Chenqian Yan, Sheng Lin, Beichen Ning, Fangmin Chen, Xing Wang

arXiv:2602.22575v11.4h-index: 6

Originality Highly original

AI Analysis

This work significantly improves the efficiency of long-context inference for large language models, which is a critical bottleneck for deploying such models in real-world applications.

S2O addresses the quadratic scaling of attention with sequence length by introducing an early stopping mechanism for sparse attention via online permutation. This method enables loading non-contiguous tokens based on importance, achieving a 3.82x reduction in single-operator MSE at matched sparsity and a 3.31x reduction in prefill compute density at matched MSE on Llama-3.1-8B with a 128K context, while preserving end-to-end accuracy and yielding 7.51x attention and 3.81x end-to-end speedups.

Attention scales quadratically with sequence length, fundamentally limiting long-context inference. Existing block-granularity sparsification can reduce latency, but coarse blocks impose an intrinsic sparsity ceiling, making further improvements difficult even with carefully engineered designs. We present S2O, which performs early stopping for sparse attention via online permutation. Inspired by virtual-to-physical address mapping in memory systems, S2O revisits and factorizes FlashAttention execution, enabling inference to load non-contiguous tokens rather than a contiguous span in the original order. Motivated by fine-grained structures in attention heatmaps, we transform explicit permutation into an online, index-guided, discrete loading policy; with extremely lightweight preprocessing and index-remapping overhead, it concentrates importance on a small set of high-priority blocks. Building on this importance-guided online permutation for loading, S2O further introduces an early-stopping rule: computation proceeds from high to low importance; once the current block score falls below a threshold, S2O terminates early and skips the remaining low-contribution blocks, thereby increasing effective sparsity and reducing computation under a controlled error budget. As a result, S2O substantially raises the practical sparsity ceiling. On Llama-3.1-8B under a 128K context, S2O reduces single-operator MSE by 3.82$\times$ at matched sparsity, and reduces prefill compute density by 3.31$\times$ at matched MSE; meanwhile, it preserves end-to-end accuracy and achieves 7.51$\times$ attention and 3.81$\times$ end-to-end speedups.

View on arXiv PDF

Similar