CL AISep 26, 2025

Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding

Shijing Hu, Jingyang Li, Zhihui Lu, Pan Zhou

arXiv:2509.22134v13 citationsh-index: 8

Originality Incremental advance

AI Analysis

This work addresses a key bottleneck for efficient inference in large language models, offering a practical solution that improves speedups, though it is incremental relative to existing speculative decoding techniques.

The paper tackles draft policy misalignment in speculative decoding for LLM inference by introducing Group Tree Optimization (GTO), which aligns training with decoding-time tree policies, resulting in a 7.4% increase in acceptance length and an additional 7.7% speedup over prior state-of-the-art methods across multiple benchmarks and models.

Speculative decoding accelerates large language model (LLM) inference by letting a lightweight draft model propose multiple tokens that the target model verifies in parallel. Yet existing training objectives optimize only a single greedy draft path, while decoding follows a tree policy that re-ranks and verifies multiple branches. This draft policy misalignment limits achievable speedups. We introduce Group Tree Optimization (GTO), which aligns training with the decoding-time tree policy through two components: (i) Draft Tree Reward, a sampling-free objective equal to the expected acceptance length of the draft tree under the target model, directly measuring decoding performance; (ii) Group-based Draft Policy Training, a stable optimization scheme that contrasts trees from the current and a frozen reference draft model, forming debiased group-standardized advantages and applying a PPO-style surrogate along the longest accepted sequence for robust updates. We further prove that increasing our Draft Tree Reward provably improves acceptance length and speedup. Across dialogue (MT-Bench), code (HumanEval), and math (GSM8K), and multiple LLMs (e.g., LLaMA-3.1-8B, LLaMA-3.3-70B, Vicuna-1.3-13B, DeepSeek-R1-Distill-LLaMA-8B), GTO increases acceptance length by 7.4% and yields an additional 7.7% speedup over prior state-of-the-art EAGLE-3. By bridging draft policy misalignment, GTO offers a practical, general solution for efficient LLM inference.

View on arXiv PDF

Similar