LG CLJun 25, 2025

Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards

Charles Arnal, Gaëtan Narozniak, Vivien Cabannes, Yunhao Tang, Julia Kempe, Remi Munos

arXiv:2506.20520v128.125 citationsh-index: 12

Originality Incremental advance

AI Analysis

This work addresses the problem of efficient and effective alignment of large language models for researchers and practitioners, but it is incremental as it builds on existing off-policy methods.

The paper tackles the suboptimal performance of off-policy reinforcement learning in aligning large language models by analyzing an off-policy REINFORCE algorithm with a tunable baseline, showing theoretically that focusing more on positive rewards than negative ones improves policy and validating this with experiments on LLM fine-tuning.

Reinforcement learning (RL) is increasingly used to align large language models (LLMs). Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques, but often result in suboptimal performance. In this work, we study the intermediate range of algorithms between off-policy RL and supervised fine-tuning by analyzing a simple off-policy REINFORCE algorithm, where the advantage is defined as $A=r-V$, with $r$ a reward and $V$ some tunable baseline. Intuitively, lowering $V$ emphasizes high-reward samples, while raising it penalizes low-reward ones more heavily. We first provide a theoretical analysis of this off-policy REINFORCE algorithm, showing that when the baseline $V$ lower-bounds the expected reward, the algorithm enjoys a policy improvement guarantee. Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones. We validate our findings experimentally in a controlled stochastic bandit setting and through fine-tuning state-of-the-art LLMs on reasoning tasks.

View on arXiv PDF

Similar