CLAIFeb 21, 2025

TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding

arXiv:2502.15197v27 citationsh-index: 18ACL
Originality Incremental advance
AI Analysis

This incremental improvement addresses the challenge of fast inference for service providers with limited capacity in multi-request settings.

The paper tackles the problem of optimizing throughput in batch speculative decoding for large language models by proposing TETRIS, a method that selects draft tokens to minimize rejections, resulting in higher acceptance rates and more efficient resource utilization compared to baselines.

We propose TETRIS, a novel method that optimizes the total throughput of batch speculative decoding in multi-request settings. Unlike existing methods that optimize for a single request or a group of requests as a whole, TETRIS actively selects the most promising draft tokens (for every request in a batch) to be accepted when verified in parallel, resulting in fewer rejected tokens and hence less wasted computing resources. Such an effective resource utilization to achieve fast inference in large language models (LLMs) is especially important to service providers with limited inference capacity. Compared to baseline speculative decoding, TETRIS yields a consistently higher acceptance rate and more effective utilization of the limited inference capacity. We show theoretically and empirically that TETRIS outperforms baseline speculative decoding and existing methods that dynamically select draft tokens, leading to a more efficient batch inference in LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes