LGAIMLDec 28, 2025

Dynamic Vocabulary Pruning: Stable LLM-RL by Taming the Tail

arXiv:2512.23087v21 citationsh-index: 3
Originality Incremental advance
AI Analysis

This addresses a fundamental stability problem in LLM-RL training, offering a solution for researchers and practitioners, though it is incremental as it builds on existing RL methods.

The paper tackles the training-inference mismatch in LLM reinforcement learning by proving that numerical divergence scales with token probability, causing destabilizing errors from low-probability tail tokens, and proposes Dynamic Vocabulary Pruning to constrain the RL objective to a safe vocabulary, achieving stable training with bounded bias.

Reinforcement Learning (RL) for Large Language Models (LLMs) faces a fundamental tension: the numerical divergence between high-throughput inference engines and numerically precise training engines. Although these systems share the same parameters, they produce slightly different probability distributions, creating a training-inference mismatch. We prove that the bound on the log-probability divergence arising from this mismatch scales as $(1-p)$, where $p$ is the token probability. This scaling induces a highly asymmetric effect: the bound vanishes for high-probability tokens but remains significant for low-probability tokens in the distribution tail. When sampled, these tail tokens introduce systematically biased errors that accumulate over sequences, thereby destabilizing gradient estimation. Instead of applying post-hoc corrections, we propose Dynamic Vocabulary Pruning (DVP), which constrains the RL objective to a dynamically determined ''safe'' vocabulary that excludes the extreme tail. This strategy trades large, destabilizing numerical errors for a small, bounded optimization bias. We validate DVP empirically by demonstrating stable training, and theoretically by deriving strict bounds on the induced bias.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes