LGAICLOct 18, 2024

How to Evaluate Reward Models for RLHF

Berkeley
arXiv:2410.14872v275 citationsh-index: 24Has CodeICLR
Originality Incremental advance
AI Analysis

This work addresses the challenge of efficiently evaluating reward models for RLHF, which is crucial for researchers and practitioners in AI alignment, though it is incremental as it builds on existing proxy evaluation methods.

The authors tackled the problem of evaluating reward models for RLHF by creating a benchmark that predicts downstream LLM performance through proxy tasks, avoiding the high cost of full RLHF training. They compiled this into the Preference Proxy Evaluations (PPE) benchmark, which is open-sourced for public use.

We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback). The gold-standard approach is to run a full RLHF training pipeline and directly probe downstream LLM performance. However, this process is prohibitively expensive. To address this, we build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks. These proxy tasks consist of a large-scale human preference and a verifiable correctness preference dataset, in which we measure 12 metrics across 12 domains. To investigate which reward model metrics are most correlated to gold-standard RLHF outcomes, we launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth. Ultimately, we compile our data and findings into Preference Proxy Evaluations (PPE), the first reward model benchmark explicitly linked to post-RLHF real-world human preference performance, which we open-source for public use and further development. Our code and evaluations can be found at https://github.com/lmarena/PPE .

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes