LGNov 6, 2024

The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation

Amazon
arXiv:2411.03786v113 citationsh-index: 19ENLSP
Originality Incremental advance
AI Analysis

This work addresses the bottleneck of inference speed for users of large language models, offering a plug-and-play solution that is incremental but practical.

The paper tackles the problem of slow autoregressive inference in language models by proposing a learning-free, negligible-cost draft strategy using N-grams from model weights and context, achieving significant inference speedups comparable to more complex methods without expensive preprocessing or model modifications.

Speculative decoding aims to speed up autoregressive generation of a language model by verifying in parallel the tokens generated by a smaller draft model.In this work, we explore the effectiveness of learning-free, negligible-cost draft strategies, namely $N$-grams obtained from the model weights and the context. While the predicted next token of the base model is rarely the top prediction of these simple strategies, we observe that it is often within their top-$k$ predictions for small $k$. Based on this, we show that combinations of simple strategies can achieve significant inference speedups over different tasks. The overall performance is comparable to more complex methods, yet does not require expensive preprocessing or modification of the base model, and allows for seamless `plug-and-play' integration into pipelines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes