CLAILGJan 22, 2024

West-of-N: Synthetic Preferences for Self-Improving Reward Models

arXiv:2401.12086v223 citationsh-index: 35
Originality Incremental advance
AI Analysis

This addresses a key bottleneck in language model alignment for AI researchers, though it is incremental as it builds on existing Best-of-N sampling strategies.

The paper tackles the problem of improving reward model quality in reinforcement learning from human feedback (RLHF) by generating synthetic preference data, resulting in performance gains comparable to adding human data.

The success of reinforcement learning from human feedback (RLHF) in language model alignment is strongly dependent on the quality of the underlying reward model. In this paper, we present a novel approach to improve reward model quality by generating synthetic preference data, thereby augmenting the training dataset with on-policy, high-quality preference pairs. Motivated by the promising results of Best-of-N sampling strategies in language model training, we extend their application to reward model training. This results in a self-training strategy to generate preference pairs by selecting the best and worst candidates in a pool of responses to a given query. Empirically, we find that this approach improves the performance of any reward model, with an effect comparable to the addition of a similar quantity of human preference data. This work opens up new avenues of research for improving RLHF for language model alignment, by offering synthetic preference generation as a solution to reward modeling challenges.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes