LGNov 6, 2024

The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation

Lawrence Stewart, Matthew Trager, Sujan Kumar Gonugondla, Stefano Soatto

Amazon

arXiv:2411.03786v115.715 citationsh-index: 19ENLSP

Originality Incremental advance

AI Analysis

This work addresses the bottleneck of inference speed for users of large language models, offering a plug-and-play solution that is incremental but practical.

The paper tackles the problem of slow autoregressive inference in language models by proposing a learning-free, negligible-cost draft strategy using N-grams from model weights and context, achieving significant inference speedups comparable to more complex methods without expensive preprocessing or model modifications.

Speculative decoding aims to speed up autoregressive generation of a language model by verifying in parallel the tokens generated by a smaller draft model.In this work, we explore the effectiveness of learning-free, negligible-cost draft strategies, namely $N$-grams obtained from the model weights and the context. While the predicted next token of the base model is rarely the top prediction of these simple strategies, we observe that it is often within their top-$k$ predictions for small $k$. Based on this, we show that combinations of simple strategies can achieve significant inference speedups over different tasks. The overall performance is comparable to more complex methods, yet does not require expensive preprocessing or modification of the base model, and allows for seamless `plug-and-play' integration into pipelines.

View on arXiv PDF

Similar