CLAICYLGJun 8, 2025

Reward Model Interpretability via Optimal and Pessimal Tokens

arXiv:2506.07326v113 citationsh-index: 8Has CodeFAccT
Originality Incremental advance
AI Analysis

This addresses the understudied issue of reward model interpretability for AI alignment, revealing concerning biases that could propagate through deployed large language models, though it is incremental in focusing on analysis rather than new methods.

The paper tackled the problem of reward model interpretability by analyzing how different models score single-token responses to value-laden prompts, uncovering substantial heterogeneity, systematic asymmetries, sensitivity to prompt framing, and overvaluation of frequent tokens across ten open-source models.

Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning generative models. However, the reward models themselves -- which directly encode human value judgments by turning prompt-response pairs into scalar rewards -- remain relatively understudied. We present a novel approach to reward model interpretability through exhaustive analysis of their responses across their entire vocabulary space. By examining how different reward models score every possible single-token response to value-laden prompts, we uncover several striking findings: (i) substantial heterogeneity between models trained on similar objectives, (ii) systematic asymmetries in how models encode high- vs low-scoring tokens, (iii) significant sensitivity to prompt framing that mirrors human cognitive biases, and (iv) overvaluation of more frequent tokens. We demonstrate these effects across ten recent open-source reward models of varying parameter counts and architectures. Our results challenge assumptions about the interchangeability of reward models, as well as their suitability as proxies of complex and context-dependent human values. We find that these models can encode concerning biases toward certain identity groups, which may emerge as unintended consequences of harmlessness training -- distortions that risk propagating through the downstream large language models now deployed to millions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes