AI LGJan 30

A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization

arXiv:2601.22718v14.42 citationsh-index: 6

Originality Incremental advance

AI Analysis

This addresses training efficiency and stability issues for researchers and practitioners using RL to fine-tune LLMs, but it is incremental as it builds on existing off-policy correction methods.

The paper tackled the problem of unstable training dynamics in reinforcement learning post-training for large language models under off-policy conditions, and the result was that the proposed MinPRO objective substantially improved training stability and peak performance across multiple benchmarks.

Reinforcement learning (RL) post-training has increasingly demonstrated strong ability to elicit reasoning behaviors in large language models (LLMs). For training efficiency, rollouts are typically generated in an off-policy manner using an older sampling policy and then used to update the current target policy. To correct the resulting discrepancy between the sampling and target policies, most existing RL objectives rely on a token-level importance sampling ratio, primarily due to its computational simplicity and numerical stability. However, we observe that token-level correction often leads to unstable training dynamics when the degree of off-policyness is large. In this paper, we revisit LLM policy optimization under off-policy conditions and show that the theoretically rigorous correction term is the prefix importance ratio, and that relaxing it to a token-level approximation can induce instability in RL post-training. To stabilize LLM optimization under large off-policy drift, we propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO). MinPRO replaces the unstable cumulative prefix ratio with a non-cumulative surrogate based on the minimum token-level ratio observed in the preceding prefix. Extensive experiments on both dense and mixture-of-experts LLMs, across multiple mathematical reasoning benchmarks, demonstrate that MinPRO substantially improves training stability and peak performance in off-policy regimes.

View on arXiv PDF

Similar