AICLLGOct 11, 2023

Online Speculative Decoding

arXiv:2310.07177v4108 citationsh-index: 42Has Code
Originality Incremental advance
AI Analysis

This addresses the inference speed bottleneck for users of large language models by improving speculative decoding, though it is incremental as it builds on existing techniques.

The paper tackles the problem of low predictive accuracy in speculative decoding for large language models by introducing online updates to draft models based on user query data, resulting in a token acceptance rate increase of 0.1 to 0.65 and latency reduction of 1.42x to 2.17x.

Speculative decoding is a pivotal technique to accelerate the inference of large language models (LLMs) by employing a smaller draft model to predict the target model's outputs. However, its efficacy can be limited due to the low predictive accuracy of the draft model, particularly when faced with diverse text inputs and a significant capability gap between the draft and target models. We introduce online speculative decoding to address this challenge. The main idea is to continuously update the (multiple) draft model(s) on observed user query data. Adapting to query distribution mitigates the shifts between the training distribution of the draft model and the query distribution, enabling the draft model to more accurately predict the target model's outputs. We develop a prototype of online speculative decoding based on knowledge distillation and evaluate it using both synthetic and real query data. The results show a substantial increase in the token acceptance rate by 0.1 to 0.65, bringing 1.42x to 2.17x latency reduction. Our code is available at https://github.com/LiuXiaoxuanPKU/OSD.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes