Dynamic Vocabulary Pruning: Stable LLM-RL by Taming the Tail
This addresses a fundamental stability problem in LLM-RL training, offering a solution for researchers and practitioners, though it is incremental as it builds on existing RL methods.
The paper tackles the training-inference mismatch in LLM reinforcement learning by proving that numerical divergence scales with token probability, causing destabilizing errors from low-probability tail tokens, and proposes Dynamic Vocabulary Pruning to constrain the RL objective to a safe vocabulary, achieving stable training with bounded bias.
Reinforcement Learning (RL) for Large Language Models (LLMs) faces a fundamental tension: the numerical divergence between high-throughput inference engines and numerically precise training engines. Although these systems share the same parameters, they produce slightly different probability distributions, creating a training-inference mismatch. We prove that the bound on the log-probability divergence arising from this mismatch scales as $(1-p)$, where $p$ is the token probability. This scaling induces a highly asymmetric effect: the bound vanishes for high-probability tokens but remains significant for low-probability tokens in the distribution tail. When sampled, these tail tokens introduce systematically biased errors that accumulate over sequences, thereby destabilizing gradient estimation. Instead of applying post-hoc corrections, we propose Dynamic Vocabulary Pruning (DVP), which constrains the RL objective to a dynamically determined ''safe'' vocabulary that excludes the extreme tail. This strategy trades large, destabilizing numerical errors for a small, bounded optimization bias. We validate DVP empirically by demonstrating stable training, and theoretically by deriving strict bounds on the induced bias.