CLAILGJan 31, 2025

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

arXiv:2502.05202v318 citationsh-index: 13ICML
Originality Incremental advance
AI Analysis

This work addresses a practical bottleneck in accelerating LLM inference for AI developers by broadening the applicability of speculative decoding with off-the-shelf models, though it is incremental as it builds on existing SD frameworks.

The paper tackled the limitation of speculative decoding methods requiring shared vocabularies between drafter and target models, presenting three new lossless algorithms that enable any off-the-shelf model as a drafter without retraining, achieving speedups of up to 2.8x over standard autoregressive decoding on tasks like summarization and programming.

Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass. However, existing SD approaches require the drafter and target models to share the same vocabulary, thus limiting the pool of possible drafters, often necessitating the training of a drafter from scratch. We present three new SD methods that remove this shared-vocabulary constraint. All three methods preserve the target distribution (i.e., they are lossless) and work with off-the-shelf models without requiring additional training or modifications. Empirically, on summarization, programming, and long-context tasks, our algorithms demonstrate significant speedups of up to 2.8x over standard autoregressive decoding. By enabling any off-the-shelf model to serve as a drafter and requiring no retraining, this work substantially broadens the applicability of the SD framework in practice.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes