RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models
This work addresses the need for better understanding and comparison of uncertainty-aware reward models to reduce human annotation costs and mitigate overoptimization in LLM alignment, though it is incremental as it focuses on evaluation rather than proposing a new method.
The authors tackled the problem of epistemic uncertainty in reward models for aligning large language models with human preferences by introducing RewardUQ, a unified framework for systematic evaluation, finding that model size and initialization most impact performance and that prior work could have benefited from alternative choices.
Reward models are central to aligning large language models (LLMs) with human preferences. Yet most approaches rely on pointwise reward estimates that overlook the epistemic uncertainty in reward models arising from limited human feedback. Recent work suggests that quantifying this uncertainty can reduce the costs of human annotation via uncertainty-guided active learning and mitigate reward overoptimization in LLM post-training. However, uncertainty-aware reward models have so far been adopted without thorough comparison, leaving them poorly understood. This work introduces a unified framework, RewardUQ, to systematically evaluate uncertainty quantification for reward models. We compare common methods along standard metrics measuring accuracy and calibration, and we propose a new ranking strategy incorporating both dimensions for a simplified comparison. Our experimental results suggest that model size and initialization have the most meaningful impact on performance, and most prior work could have benefited from alternative design choices. To foster the development and evaluation of new methods and aid the deployment in downstream applications, we release our open-source framework as a Python package. Our code is available at https://github.com/lasgroup/rewarduq.