CLMay 7, 2024

Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models

arXiv:2405.04304v518 citationsh-index: 13ENLSP
Originality Incremental advance
AI Analysis

This work addresses inference latency reduction for users of large language models, representing an incremental improvement over existing speculative decoding techniques.

The paper tackled the suboptimal use of static speculation lookahead in speculative decoding for large language models by introducing DISCO, a method for dynamic selection, achieving an average speedup of 10% compared to the best static baseline while generating identical text.

Speculative decoding is commonly used for reducing the inference latency of large language models. Its effectiveness depends highly on the speculation lookahead (SL)-the number of tokens generated by the draft model at each iteration. In this work we show that the common practice of using the same SL for all iterations (static SL) is suboptimal. We introduce DISCO (DynamIc SpeCulation lookahead Optimization), a novel method for dynamically selecting the SL. Our experiments with four datasets show that DISCO reaches an average speedup of 10% compared to the best static SL baseline, while generating the exact same text.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes