LG AI CLJul 21, 2025

Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback

Johannes Ackermann, Takashi Ishida, Masashi Sugiyama

arXiv:2507.15507v113.05 citationsh-index: 8Has Code

Originality Incremental advance

AI Analysis

This addresses a key bottleneck in RLHF for improving language model alignment with human preferences, though it is an incremental method building on existing RLHF frameworks.

The paper tackles overoptimization in Reinforcement Learning from Human Feedback (RLHF) for language models, where reward models become inaccurate due to distribution shift, and proposes Off-Policy Corrected Reward Modeling (OCRM) to correct this without new labels, resulting in significantly better performance on summarization and chatbot datasets.

Reinforcement Learning from Human Feedback (RLHF) allows us to train models, such as language models (LMs), to follow complex human preferences. In RLHF for LMs, we first train an LM using supervised fine-tuning, sample pairs of responses, obtain human feedback, and use the resulting data to train a reward model (RM). RL methods are then used to train the LM to maximize the reward given by the RM. As training progresses, the responses generated by the LM no longer resemble the responses seen by the RM during training, leading to the RM becoming inaccurate. The score given by the RM keeps increasing, but the learned behavior no longer matches the human preferences. This issue is known as overoptimization. We investigate overoptimization from the point of view of distribution shift and show that the shift results in an inconsistent estimate of the RM parameters, leading to an inconsistent estimate of the policy gradient. We propose Off-Policy Corrected Reward Modeling (OCRM), which iteratively off-policy corrects the RM using importance weighting, without requiring new labels or samples. This results in a more accurate RM, which empirically leads to an improved final policy. We validate our approach in experiments with summarization and chatbot datasets and show that it performs significantly better than standard RLHF methods and baselines. Our implementation is available at https://github.com/JohannesAck/OffPolicyCorrectedRewardModeling

View on arXiv PDF Code

Similar