LGAIFeb 5

TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference

arXiv:2602.05145v11 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses efficiency issues in LLM serving for real-world applications, representing an incremental improvement over existing speculative decoding methods.

The paper tackles the challenge of adapting speculative decoding for LLM inference under evolving workloads by introducing TIDE, a framework that integrates online draft adaptation into serving engines, achieving up to 1.15x throughput improvement and reducing draft training time by 1.67x.

Speculative decoding can substantially accelerate LLM inference, but realizing its benefits in practice is challenging due to evolving workloads and system-level constraints. We present TIDE (Temporal Incremental Draft Engine), a serving-engine-native framework that integrates online draft adaptation directly into high-performance LLM inference systems. TIDE reuses target model hidden states generated during inference as training signals, enabling zero-overhead draft adaptation without reloading the target model, and employs adaptive runtime control to activate speculation and training only when beneficial. TIDE exploits heterogeneous clusters by mapping decoupled inference and training to appropriate GPU classes. Across diverse real-world workloads, TIDE achieves up to 1.15x throughput improvement over static speculative decoding while reducing draft training time by 1.67x compared to approaches that recompute training signals.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes