CLApr 29, 2024

Accelerating Production LLMs with Combined Token/Embedding Speculators

arXiv:2404.19124v28 citationsh-index: 36
Originality Incremental advance
AI Analysis

This work addresses the bottleneck of inference latency for production LLMs, offering a significant speedup that is incremental but impactful for real-world deployment.

The paper tackles the problem of slow inference speeds for large language models in production by introducing novel speculative decoding draft models that condition on both context vectors and sampled tokens to predict high-quality n-grams, resulting in a 2-3x acceleration in wall-clock inference speeds for optimized base models.

This technical report describes the design and training of novel speculative decoding draft models, for accelerating the inference speeds of large language models in a production environment. By conditioning draft predictions on both context vectors and sampled tokens, we can train our speculators to efficiently predict high-quality n-grams, which the base model then accepts or rejects. This allows us to effectively predict multiple tokens per inference forward pass, accelerating wall-clock inference speeds of highly optimized base model implementations by a factor of 2-3x. We explore these initial results and describe next steps for further improvements.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes