CLFeb 27, 2025

RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding

arXiv:2502.20330v29 citationsh-index: 7ICML
Originality Incremental advance
AI Analysis

This addresses efficiency challenges for users of long-context LLMs, offering a novel integration of RAG and speculative decoding, though it builds incrementally on existing methods.

The paper tackles the computational inefficiency of long-context inference in large language models by introducing Retrieval-Augmented Speculative Decoding (RAPID), which uses retrieval-augmented draft models to accelerate generation, achieving over 2x speedups and performance improvements like from 39.33 to 42.83 on InfiniteBench for LLaMA-3.1-8B.

The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference presents significant efficiency challenges. While Speculative Decoding (SD) traditionally accelerates inference using smaller draft models, its effectiveness diminishes substantially in long-context scenarios due to memory-bound KV cache operations. We introduce Retrieval-Augmented Speculative Decoding (RAPID), which leverages RAG for both accelerating and enhancing generation quality in long-context inference. RAPID introduces the RAG drafter-a draft LLM operating on shortened retrieval contexts-to speculate on the generation of long-context target LLMs. Our approach enables a new paradigm where same-scale or even larger LLMs can serve as RAG drafters while maintaining computational efficiency. To fully leverage the potentially superior capabilities from stronger RAG drafters, we develop an inference-time knowledge transfer that enriches the target distribution by RAG. Extensive experiments on the LLaMA-3.1 and Qwen2.5 backbones demonstrate that RAPID effectively integrates the strengths of both RAG and long-context LLMs, achieving significant performance improvements (e.g., from 39.33 to 42.83 on InfiniteBench for LLaMA-3.1-8B) with more than 2x speedups for long-context inference. Our analyses also reveal the robustness of RAPID across various context lengths and retrieval quality.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes