LGAIMar 7, 2025

Soft Policy Optimization: Online Off-Policy RL for Sequence Models

arXiv:2503.05453v18 citationsh-index: 16
Originality Incremental advance
AI Analysis

This addresses the problem of inefficient and limited exploration in RL for language models, offering a scalable solution for researchers and practitioners, though it is an incremental improvement over existing methods.

The paper tackles the sample inefficiency and exploration difficulties in RL-based post-training of language models by introducing Soft Policy Optimization (SPO), a method that learns from arbitrary online and offline trajectories without a separate value model. In experiments on code contests, SPO outperforms PPO on pass@10, is faster, more memory efficient, and learns more diverse policies.

RL-based post-training of language models is almost exclusively done using on-policy methods such as PPO. These methods cannot learn from arbitrary sequences such as those produced earlier in training, in earlier runs, by human experts or other policies, or by decoding and exploration methods. This results in severe sample inefficiency and exploration difficulties, as well as a potential loss of diversity in the policy responses. Moreover, asynchronous PPO implementations require frequent and costly model transfers, and typically use value models which require a large amount of memory. In this paper we introduce Soft Policy Optimization (SPO), a simple, scalable and principled Soft RL method for sequence model policies that can learn from arbitrary online and offline trajectories and does not require a separate value model. In experiments on code contests, we shows that SPO outperforms PPO on pass@10, is significantly faster and more memory efficient, is able to benefit from off-policy data, enjoys improved stability, and learns more diverse (i.e. soft) policies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes