CL AIOct 17, 2025

TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

Sibo Xiao, Jinyuan Fu, Zhongle Xie, Lidan Shou

arXiv:2510.15545v2h-index: 1

AI Analysis

This enables a universal approach for draft model selection in speculative decoding, making it more versatile for LLM acceleration, though it is incremental as it builds on existing speculative decoding methods.

The paper tackles the limitation in speculative decoding where draft and target models must share the same vocabulary, which restricts model selection and requires retraining. It proposes TokenTiming, a dynamic alignment method that allows mismatched vocabularies, achieving a 1.57x speedup in experiments.

Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental constraint: the draft and target models must share the same vocabulary, thus limiting the herd of available draft models and often necessitating the training of a new model from scratch. Inspired by Dynamic Time Warping (DTW), a classic algorithm for aligning time series, we propose the algorithm TokenTiming for universal speculative decoding. It operates by re-encoding the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions for speculative sampling. Benefiting from this, our method accommodates mismatched vocabularies and works with any off-the-shelf models without retraining and modification. We conduct comprehensive experiments on various tasks, demonstrating 1.57x speedup. This work enables a universal approach for draft model selection, making SD a more versatile and practical tool for LLM acceleration.

View on arXiv PDF

Similar