CLFeb 26, 2025

TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation

Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng

Peking U

arXiv:2502.18890v214.74 citationsh-index: 10Has Code

Originality Incremental advance

AI Analysis

This work provides a scalable solution for accelerating ultra-long sequence generation, which is crucial for applications requiring extensive text outputs, though it is incremental as it builds on speculative decoding methods.

The paper tackles the problem of time-intensive ultra-long sequence generation (up to 100K tokens) with large language models by introducing TOKENSWIFT, a framework that addresses challenges like frequent model reloading and dynamic KV management, achieving over 3 times speedup across various model scales and architectures while maintaining quality.

Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at https://github.com/bigai-nlco/TokenSwift.

View on arXiv PDF Code

Similar