CLLGJun 24, 2024

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

arXiv:2406.16858v2301 citations
Originality Incremental advance
AI Analysis

This work addresses the inference bottleneck for users of large language models, offering a significant but incremental improvement over existing speculative sampling techniques.

The paper tackles the problem of slow and expensive inference in large language models by introducing EAGLE-2, a speculative sampling method that uses context-aware dynamic draft trees, achieving speedup ratios of 3.05x-4.26x, which is 20%-40% faster than its predecessor EAGLE-1 while maintaining lossless text generation.

Inference with modern Large Language Models (LLMs) is expensive and time-consuming, and speculative sampling has proven to be an effective solution. Most speculative sampling methods such as EAGLE use a static draft tree, implicitly assuming that the acceptance rate of draft tokens depends only on their position. Interestingly, we found that the acceptance rate of draft tokens is also context-dependent. In this paper, building upon EAGLE, we propose EAGLE-2, which introduces a new technique of context-aware dynamic draft tree into drafting modeling. This improvement leverages the fact that the draft model of EAGLE is well-calibrated: the confidence scores from the draft model approximate acceptance rates with small errors. We conducted extensive evaluations on three series of LLMs and six tasks, with EAGLE-2 achieving speedup ratios 3.05x-4.26x, which is 20%-40% faster than EAGLE-1. EAGLE-2 also ensures that the distribution of the generated text remains unchanged, making it a lossless acceleration algorithm.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes