CLAINov 8, 2024

SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding

arXiv:2411.05289v129 citationsh-index: 6Has CodeEMNLP
Originality Highly original
AI Analysis

This addresses inference speed bottlenecks in LLMs for NLP applications, offering a practical improvement over existing heuristic approaches.

The paper tackles the problem of low acceptance rates in multi-draft speculative decoding for LLMs, which limits inference speed, and presents SpecHub, a method that improves acceptance rates with linear computational overhead, generating 0.05-0.27 and 0.02-0.16 more tokens per step than baseline methods.

Large Language Models (LLMs) have become essential in advancing natural language processing (NLP) tasks, but their sequential token generation limits inference speed. Multi-Draft Speculative Decoding (MDSD) offers a promising solution by using a smaller draft model to generate multiple token sequences, which the target LLM verifies in parallel. However, current heuristic approaches, such as Recursive Rejection Sampling (RRS), suffer from low acceptance rates in subsequent drafts, limiting the advantages of using multiple drafts. Meanwhile, Optimal Transport with Membership Cost (OTM) can theoretically improve acceptance rates, but its computational cost is too high for real-time use. We present SpecHub, a novel, efficient sampling-verification method for MDSD that improves acceptance rates with only linear computational overhead. By simplifying the OTM problem into a compact Linear Programming model, SpecHub significantly reduces computational complexity. It further accelerates sampling by leveraging a sparse joint distribution, focusing computation on high-probability token sequences. In extensive experiments, Spechub consistently generates 0.05-0.27 and 0.02-0.16 more tokens per step than RRS and RRS without replacement. We attach our code at \url{https://github.com/MasterGodzilla/Speculative_decoding_OT}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes