LG AIMar 7, 2025

Soft Policy Optimization: Online Off-Policy RL for Sequence Models

Taco Cohen, David W. Zhang, Kunhao Zheng, Yunhao Tang, Remi Munos, Gabriel Synnaeve

arXiv:2503.05453v18 citationsh-index: 16

Originality Incremental advance

AI Analysis

This addresses the problem of inefficient and limited exploration in RL for language models, offering a scalable solution for researchers and practitioners, though it is an incremental improvement over existing methods.

The paper tackles the sample inefficiency and exploration difficulties in RL-based post-training of language models by introducing Soft Policy Optimization (SPO), a method that learns from arbitrary online and offline trajectories without a separate value model. In experiments on code contests, SPO outperforms PPO on pass@10, is faster, more memory efficient, and learns more diverse policies.

RL-based post-training of language models is almost exclusively done using on-policy methods such as PPO. These methods cannot learn from arbitrary sequences such as those produced earlier in training, in earlier runs, by human experts or other policies, or by decoding and exploration methods. This results in severe sample inefficiency and exploration difficulties, as well as a potential loss of diversity in the policy responses. Moreover, asynchronous PPO implementations require frequent and costly model transfers, and typically use value models which require a large amount of memory. In this paper we introduce Soft Policy Optimization (SPO), a simple, scalable and principled Soft RL method for sequence model policies that can learn from arbitrary online and offline trajectories and does not require a separate value model. In experiments on code contests, we shows that SPO outperforms PPO on pass@10, is significantly faster and more memory efficient, is able to benefit from off-policy data, enjoys improved stability, and learns more diverse (i.e. soft) policies.

View on arXiv PDF

Similar