LGJun 11, 2025

On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention

Yeonju Ro, Zhenyu Zhang, Souvik Kundu, Zhangyang Wang, Aditya Akella

arXiv:2506.09316v39.42 citationsh-index: 62Has CodeICML

Originality Incremental advance

AI Analysis

This work addresses efficiency bottlenecks in LLMs for long-context tasks, offering a domain-specific improvement that is incremental over existing linear attention methods.

The paper tackled the problem of high compute and memory costs in large language models on lengthy inputs by proposing dual-state linear attention (DSLA) and an online adaptive distillation framework (DSLA-Serve), resulting in 2.3x faster inference than Llama2-7B and 3.0x faster than Zamba-7B while maintaining comparable performance.

Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs. While sub-quadratic methods (e.g., linear attention) can reduce these costs, they often degrade accuracy due to overemphasizing recent tokens. In this work, we first propose dual-state linear attention (DSLA), a novel design that maintains two specialized hidden states-one for preserving historical context and one for tracking recency-thereby mitigating the short-range bias typical of linear-attention architectures. To further balance efficiency and accuracy under dynamic workload conditions, we introduce DSLA-Serve, an online adaptive distillation framework that progressively replaces Transformer layers with DSLA layers at inference time, guided by a sensitivity-based layer ordering. DSLA-Serve uses a chained fine-tuning strategy to ensure that each newly converted DSLA layer remains consistent with previously replaced layers, preserving the overall quality. Extensive evaluations on commonsense reasoning, long-context QA, and text summarization demonstrate that DSLA-Serve yields 2.3x faster inference than Llama2-7B and 3.0x faster than the hybrid Zamba-7B, while retaining comparable performance across downstream tasks. Our ablation studies show that DSLA's dual states capture both global and local dependencies, addressing the historical-token underrepresentation seen in prior linear attentions. Codes are available at https://github.com/utnslab/DSLA-Serve.

View on arXiv PDF Code

Similar