VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training
This addresses a critical stability problem for researchers and practitioners training LLMs with RL, though it is an incremental improvement over existing methods.
The paper tackles training instability in reinforcement learning for large language models caused by policy divergence, proposing VESPO, a method that maintains stable training under high staleness ratios and asynchronous execution, achieving consistent performance gains on mathematical reasoning benchmarks.
Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference engines all cause the behavior policy to diverge from the current policy, risking training collapse. Importance sampling provides a principled correction for this distribution shift but suffers from high variance; existing remedies such as token-level clipping and sequence-level normalization lack a unified theoretical foundation. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By incorporating variance reduction into a variational formulation over proposal distributions, VESPO derives a closed-form reshaping kernel that operates directly on sequence-level importance weights without length normalization. Experiments on mathematical reasoning benchmarks show that VESPO maintains stable training under staleness ratios up to 64x and fully asynchronous execution, and delivers consistent gains across both dense and Mixture-of-Experts models. Code is available at https://github.com/FloyedShen/VESPO