LGFeb 17

MoE-Spec: Expert Budgeting for Efficient Speculative Decoding

arXiv:2602.16052v13 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses efficiency issues in LLM inference for MoE models, offering incremental improvements in speculative decoding.

The paper tackles the bottleneck in speculative decoding for Mixture-of-Experts (MoE) models, where large draft trees increase memory pressure and reduce speedups, by proposing MoE-Spec, a training-free expert budgeting method that enforces fixed expert capacity limits to load only key experts, resulting in 10-30% higher throughput than state-of-the-art baselines at comparable quality.

Speculative decoding accelerates Large Language Model (LLM) inference by verifying multiple drafted tokens in parallel. However, for Mixture-of-Experts (MoE) models, this parallelism introduces a severe bottleneck: large draft trees activate many unique experts, significantly increasing memory pressure and diminishing speedups from speculative decoding relative to autoregressive decoding. Prior methods reduce speculation depth when MoE verification becomes expensive. We propose MoE-Spec, a training-free verification-time expert budgeting method that decouples speculation depth from memory cost by enforcing a fixed expert capacity limit at each layer, loading only the experts that contribute most to verification and dropping the long tail of rarely used experts that drive bandwidth overhead. Experiments across multiple model scales and datasets show that this method yields 10--30\% higher throughput than state-of-the-art speculative decoding baselines (EAGLE-3) at comparable quality, with flexibility to trade accuracy for further latency reductions through tighter budgets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes