CL AIJun 10, 2025

Draft-based Approximate Inference for LLMs

Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee

arXiv:2506.08373v29.64 citationsh-index: 5Has Code

Originality Highly original

AI Analysis

This work addresses the computational inefficiency of LLM inference for long contexts, offering a novel approach that improves accuracy over prior methods, though it is incremental in building on draft model concepts.

The paper tackles the problem of optimizing inference for long-context LLMs by proposing a framework that uses small draft models to predict token and KV pair importance, resulting in methods like SpecKV and SpecPC that achieve higher accuracy than existing baselines while maintaining improvements in memory usage, latency, and throughput.

Optimizing inference for long-context Large Language Models (LLMs) is increasingly important due to the quadratic compute and linear memory complexity of Transformers. Existing approximation methods, such as key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on rough predictions of token or KV pair importance. We propose a novel framework for approximate LLM inference that leverages small draft models to more accurately predict the importance of tokens and KV pairs. Specifically, we introduce two instantiations of our proposed framework: (i) SpecKV, the first method that leverages a draft output to accurately assess the importance of each KV pair for more effective KV cache dropping, and (ii) SpecPC, which uses the draft model's attention activations to identify and discard unimportant prompt tokens. We motivate our methods with theoretical and empirical analyses, and show a strong correlation between the attention patterns of draft and target models. Extensive experiments on long-context benchmarks show that our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput. Our code is available at https://github.com/furiosa-ai/draft-based-approx-llm.

View on arXiv PDF Code

Similar