Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding
This work addresses latency optimization in LLM decoding, which is crucial for improving inference efficiency in AI applications, representing a novel method for a known bottleneck.
The paper tackles the problem of suboptimal performance in speculative decoding for LLM inference due to mismatched dynamic speculation and static runtime assumptions, achieving up to 3.98x speedup over state-of-the-art baselines.
Speculative decoding improves LLM inference by generating and verifying multiple tokens in parallel, but existing systems suffer from suboptimal performance due to a mismatch between dynamic speculation and static runtime assumptions. We present Yggdrasil, a co-designed system that enables latency-optimal speculative decoding through context-aware tree drafting and compiler-friendly execution. Yggdrasil introduces an equal-growth tree structure for static graph compatibility, a latency-aware optimization objective for draft selection, and stage-based scheduling to reduce overhead. Yggdrasil supports unmodified LLMs and achieves up to $3.98\times$ speedup over state-of-the-art baselines across multiple hardware setups.