Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

Tianle Zhong, Neiwen Ling, Yifan Pi, Zijun Wei, Tianshu Yu, Geoffrey Fox, Peng Wu, Xiao Yu

arXiv:2605.1422077.8

Predicted impact top 17% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and engineers working on LLM RL systems, this work highlights a previously overlooked systems-level perturbation that can destabilize training, offering diagnostic tools and mitigation strategies.

The paper identifies Training-Inference Mismatch (TIM) in LLM reinforcement learning, where token probabilities differ between rollout generation and policy optimization due to implementation differences, and shows that small numerical disagreements can cause training collapse. The authors propose a diagnostic setting (VeXact) and identify remedies to mitigate TIM.

Modern LLM RL systems separate rollout generation from policy optimization. These two stages are expected to produce token probabilities that match exactly. However, implementation differences can make them assign different values to the same sequence under the same model weights, inducing Training-Inference Mismatch (TIM). TIM is difficult to inspect because it is entangled with off-policy drift and common stabilization mechanisms. In this work, we isolate TIM in a zero-mismatch diagnostic setting (VeXact), and show that small token-level numerical disagreements can independently cause training collapse. We further show that TIM changes the effective optimization problem, and identify a set of remedies that could mitigate TIM. Our results suggest that TIM is not benign numerical noise, but a systems-level perturbation that should be treated as a first-order factor in analyzing LLM RL stability.

View on arXiv PDF

Similar