CLMay 7, 2024

Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models

Jonathan Mamou, Oren Pereg, Daniel Korat, Moshe Berchansky, Nadav Timor, Moshe Wasserblat, Roy Schwartz

arXiv:2405.04304v511.518 citationsh-index: 13ENLSP

Originality Incremental advance

AI Analysis

This work addresses inference latency reduction for users of large language models, representing an incremental improvement over existing speculative decoding techniques.

The paper tackled the suboptimal use of static speculation lookahead in speculative decoding for large language models by introducing DISCO, a method for dynamic selection, achieving an average speedup of 10% compared to the best static baseline while generating identical text.

Speculative decoding is commonly used for reducing the inference latency of large language models. Its effectiveness depends highly on the speculation lookahead (SL)-the number of tokens generated by the draft model at each iteration. In this work we show that the common practice of using the same SL for all iterations (static SL) is suboptimal. We introduce DISCO (DynamIc SpeCulation lookahead Optimization), a novel method for dynamically selecting the SL. Our experiments with four datasets show that DISCO reaches an average speedup of 10% compared to the best static SL baseline, while generating the exact same text.

View on arXiv PDF

Similar