LGDec 17, 2023

Policy Optimization in RLHF: The Impact of Out-of-preference Data

arXiv:2312.10584v240 citationsh-index: 12
Originality Incremental advance
AI Analysis

This addresses the challenge of policy optimization in reinforcement learning from human feedback for AI alignment, but it is incremental as it builds on existing methods like DPO and RMB-PO.

The paper tackled the problem of aligning intelligent agents with human preferences by comparing Direct Preference Optimization (DPO) and Reward-Model-Based Policy Optimization (RMB-PO) methods, finding that DPO performs poorly while RMB-PO+ performs best by leveraging out-of-preference data to improve performance through reward model generalization.

Aligning intelligent agents with human preferences and values is important. This paper examines two popular alignment methods: Direct Preference Optimization (DPO) and Reward-Model-Based Policy Optimization (RMB-PO). A variant of RMB-PO, referred to as RMB-PO+ is also considered. These methods, either explicitly or implicitly, learn a reward model from preference data and differ in the data used for policy optimization to unlock the generalization ability of the reward model. In particular, compared with DPO, RMB-PO additionally uses policy-generated data, and RMB-PO+ further leverages new, preference-free data. We examine the impact of such out-of-preference data. Our study, conducted through controlled and synthetic experiments, demonstrates that DPO performs poorly, whereas RMB-PO+ performs the best. In particular, even when providing the policy model with a good feature representation, we find that policy optimization with adequate out-of-preference data significantly improves performance by harnessing the reward model's generalization capabilities.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes