AIApr 21, 2025

Establishing Reliability Metrics for Reward Models in Large Language Models

arXiv:2504.14838v11 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses the challenge of ensuring reward models align with human preferences in AI systems, though it is incremental as it provides a new evaluation method rather than a fundamental breakthrough.

The paper tackles the problem of uncertain reliability in reward models (RMs) for large language models by proposing the RETA metric to directly measure RM reliability, demonstrating its superior stability in experiments.

The reward model (RM) that represents human preferences plays a crucial role in optimizing the outputs of large language models (LLMs), e.g., through reinforcement learning from human feedback (RLHF) or rejection sampling. However, a long challenge for RM is its uncertain reliability, i.e., LLM outputs with higher rewards may not align with actual human preferences. Currently, there is a lack of a convincing metric to quantify the reliability of RMs. To bridge this gap, we propose the \textit{\underline{R}eliable at \underline{$η$}} (RETA) metric, which directly measures the reliability of an RM by evaluating the average quality (scored by an oracle) of the top $η$ quantile responses assessed by an RM. On top of RETA, we present an integrated benchmarking pipeline that allows anyone to evaluate their own RM without incurring additional Oracle labeling costs. Extensive experimental studies demonstrate the superior stability of RETA metric, providing solid evaluations of the reliability of various publicly available and proprietary RMs. When dealing with an unreliable RM, we can use the RETA metric to identify the optimal quantile from which to select the responses.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes