LGMar 28, 2025

Probabilistic Uncertain Reward Model

arXiv:2503.22480v64 citationsh-index: 30
Originality Incremental advance
AI Analysis

This work addresses reward hacking in RLHF for training large language models, representing an incremental improvement by generalizing the Bradley-Terry model to incorporate uncertainty.

The paper tackles the problem of overconfidence in conventional reward models for reinforcement learning from human feedback (RLHF), which leads to reward hacking and degraded performance. It proposes the Probabilistic Uncertain Reward Model (PURM), which outperforms existing methods with more accurate reward and uncertainty estimations, sustaining effective learning for more optimization steps and achieving higher maximum win rates.

Reinforcement learning from human feedback (RLHF) is a critical technique for training large language models. However, conventional reward models based on the Bradley-Terry model (BTRM) often suffer from overconfidence when faced with inconsistent labels or out-of-distribution samples, leading to reward hacking, where the policy model blindly optimizes for proxy rewards while degrading true performance. This paper proposes the Probabilistic Uncertain Reward Model (PURM), which generalizes the Bradley-Terry model to learn the reward distributions that emerged from the preference data. We theoretically derive the loss function of PURM and introduce a novel method that uses the overlap between distributions to quantify uncertainty. Empirical results show that PURM outperforms existing methods with more accurate reward and sound uncertainty estimations, and sustains effective learning for more optimization steps and obtain higher maximum win rate in RLHF. The data and code of this paper are released at https://anonymous.4open.science/r/Probabilistic-Uncertain-Reward-Model/

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes