LGJun 11, 2025

On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention

arXiv:2506.09316v32 citationsh-index: 24Has CodeICML
Originality Incremental advance
AI Analysis

This work addresses efficiency bottlenecks in LLMs for long-context tasks, offering a domain-specific improvement that is incremental over existing linear attention methods.

The paper tackled the problem of high compute and memory costs in large language models on lengthy inputs by proposing dual-state linear attention (DSLA) and an online adaptive distillation framework (DSLA-Serve), resulting in 2.3x faster inference than Llama2-7B and 3.0x faster than Zamba-7B while maintaining comparable performance.

Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs. While sub-quadratic methods (e.g., linear attention) can reduce these costs, they often degrade accuracy due to overemphasizing recent tokens. In this work, we first propose dual-state linear attention (DSLA), a novel design that maintains two specialized hidden states-one for preserving historical context and one for tracking recency-thereby mitigating the short-range bias typical of linear-attention architectures. To further balance efficiency and accuracy under dynamic workload conditions, we introduce DSLA-Serve, an online adaptive distillation framework that progressively replaces Transformer layers with DSLA layers at inference time, guided by a sensitivity-based layer ordering. DSLA-Serve uses a chained fine-tuning strategy to ensure that each newly converted DSLA layer remains consistent with previously replaced layers, preserving the overall quality. Extensive evaluations on commonsense reasoning, long-context QA, and text summarization demonstrate that DSLA-Serve yields 2.3x faster inference than Llama2-7B and 3.0x faster than the hybrid Zamba-7B, while retaining comparable performance across downstream tasks. Our ablation studies show that DSLA's dual states capture both global and local dependencies, addressing the historical-token underrepresentation seen in prior linear attentions. Codes are available at https://github.com/utnslab/DSLA-Serve.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes