CLAILGJun 12, 2024

It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF

arXiv:2406.07971v25 citations
AI Analysis

This work addresses a critical bottleneck in aligning language models with human preferences, offering incremental improvements for RLHF practitioners.

The paper tackles the problem of misalignment between reward models and policy models in RLHF, which causes a 35% mismatch with human preferences, and proposes an automatic metric called SEAM to measure this seamlessness, improving RLHF performance by 4.5% with filtered data and 4% with guided augmentation.

Reinforcement Learning from Human Feedback (RLHF) involves training policy models (PMs) and reward models (RMs) to align language models with human preferences. Instead of focusing solely on PMs and RMs independently, we propose to examine their interactions during fine-tuning, introducing the concept of seamlessness. Our study starts with observing the saturation phenomenon, where continual improvements in RM and PM do not translate into RLHF progress. Our analysis shows that RMs fail to assign proper scores to PM responses, resulting in a 35% mismatch rate with human preferences, highlighting a significant discrepancy between PM and RM. To measure seamlessness between PM and RM without human effort, we propose an automatic metric, SEAM. SEAM quantifies the discrepancies between PM and RM judgments induced by data samples. We validate the effectiveness of SEAM in data selection and model augmentation. Our experiments demonstrate that (1) using SEAM-filtered data for RL training improves RLHF performance by 4.5%, and (2) SEAM-guided model augmentation results in a 4% performance improvement over standard augmentation methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes