CL LG SYOct 31, 2025

SpecAttn: Speculating Sparse Attention

arXiv:2510.27641v1h-index: 2

Originality Incremental advance

AI Analysis

This addresses inference efficiency for LLM users by reducing computational costs, though it is incremental as it builds on existing speculative decoding techniques.

The paper tackles the computational bottleneck of self-attention in LLMs by introducing SpecAttn, a training-free method that integrates with speculative decoding to enable efficient sparse attention, achieving over 75% reduction in key-value cache accesses with only a 15.29% increase in perplexity on PG-19.

Large Language Models (LLMs) face significant computational bottlenecks during inference due to the quadratic complexity of self-attention mechanisms, particularly as context lengths increase. We introduce SpecAttn, a novel training-free approach that seamlessly integrates with existing speculative decoding techniques to enable efficient sparse attention in pre-trained transformers. Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model, eliminating redundant computation while maintaining output quality. SpecAttn employs three core techniques: KL divergence-based layer alignment between draft and target models, a GPU-optimized sorting-free algorithm for top-p token selection from draft attention patterns, and dynamic key-value cache pruning guided by these predictions. By leveraging the computational work already performed in standard speculative decoding pipelines, SpecAttn achieves over 75% reduction in key-value cache accesses with a mere 15.29% increase in perplexity on the PG-19 dataset, significantly outperforming existing sparse attention methods. Our approach demonstrates that speculative execution can be enhanced to provide approximate verification without significant performance degradation.

View on arXiv PDF

Similar