SRT: Accelerating Reinforcement Learning via Speculative Rollout with Tree-Structured Cache
This addresses the computational bottleneck in RL training for language models, offering a practical speedup for researchers and practitioners, though it appears incremental as it builds on existing speculative decoding techniques.
The paper tackles the problem of slow reinforcement learning for language models by introducing SRT, a method that accelerates on-policy RL using a tree-structured cache for speculative decoding, achieving up to 2.08x wall-clock time speedup during rollout.
We present Speculative Rollout with Tree-Structured Cache (SRT), a simple, model-free approach to accelerate on-policy reinforcement learning (RL) for language models without sacrificing distributional correctness. SRT exploits the empirical similarity of rollouts for the same prompt across training steps by storing previously generated continuations in a per-prompt tree-structured cache. During generation, the current policy uses this tree as the draft model for performing speculative decoding. To keep the cache fresh and improve draft model quality, SRT updates trees online from ongoing rollouts and proactively performs run-ahead generation during idle GPU bubbles. Integrated into standard RL pipelines (\textit{e.g.}, PPO, GRPO and DAPO) and multi-turn settings, SRT consistently reduces generation and step latency and lowers per-token inference cost, achieving up to 2.08x wall-clock time speedup during rollout.