CLAIDec 11, 2025

SWAA: Sliding Window Attention Adaptation for Efficient Long-Context LLMs Without Pretraining

arXiv:2512.10411v43 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses efficiency problems for users of large language models in long-context scenarios, offering a practical, incremental improvement over existing methods.

The paper tackles the high computational cost of long-context inference in LLMs by proposing SWAA, a toolkit that adapts models to sliding window attention without pretraining, achieving 30% to 100% speedups with acceptable quality loss.

The quadratic complexity of self-attention in Transformer-based Large Language Models (LLMs) renders long-context inference prohibitively expensive. While Sliding Window Attention (SWA), the simplest sparse attention pattern, offers a linear-complexity alternative, naively applying it to models pretrained with Full Attention (FA) causes catastrophic long-context performance collapse due to the training-inference mismatch. To address this, we propose Sliding Window Attention Adaptation (SWAA), a plug-and-play toolkit of recipes that adapt FA models to SWA without costly pretraining. SWAA systematically combines five strategies: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments demonstrate that while individual methods are insufficient, specific synergistic combinations can effectively recover original long-context capabilities. After further analyzing performance-efficiency trade-offs, we identify recommended SWAA configurations for diverse scenarios, which achieve 30% to 100% speedups for long-context LLM inference with acceptable quality loss. Our code is available at https://github.com/yuyijiong/sliding-window-attention-adaptation

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes