IT ITApr 22

DiP-SD: Distributed Pipelined Speculative Decoding for Efficient LLM Inference at the Edge

arXiv:2604.2091965.5h-index: 6

Predicted impact top 4% in IT · last 90 daysOriginality Incremental advance

AI Analysis

Addresses throughput bottlenecks in multi-user edge LLM inference by jointly optimizing batching and draft lengths.

DiP-SD optimizes batching, pipeline scheduling, and draft token lengths for distributed speculative decoding in multi-user edge LLM inference, achieving up to 17.89x throughput over autoregressive decoding and 1.93x over greedy batching.

Speculative decoding has emerged as a promising technique for large language model (LLM) inference by accelerating autoregressive decoding via draft-then-verify. This paper studies a new edge scenario with multi-user inference, where draft tokens are generated locally on devices and subsequently offloaded to a centralized edge server for batch verification. The key challenge is to sustain high throughput under coupled decisions of (i) batching and pipeline scheduling and (ii) per user draft token length. We propose DiP-SD, which exploits two complementary parallelism dimensions: device-level distributed drafting and phase-level draft-verify pipelining. We formulate a throughput-maximization objective, defined as the expected number of accepted tokens per unit time, and jointly optimize the number of batches, user-to-batch assignment, and integer draft lengths. To solve the resulting fractional mixed-integer program, DiP-SD scans the batch number and iteratively alternates between an association subproblem and a draft-length subproblem. Numerical results under a Qwen3-1.7B/Qwen3-32B device-edge deployment show that DiP-SD achieves up to 17.89x throughput over autoregressive decoding (AD) and 1.93x over AD with greedy batching.

View on arXiv PDF

Similar