LGCLFeb 9

Bayesian Preference Learning for Test-Time Steerable Reward Models

arXiv:2602.08819v1h-index: 8
Originality Highly original
AI Analysis

This addresses the need for more adaptable reward models in reinforcement learning for complex tasks like verifiable rewards and multi-objective alignment, representing a novel method for a known bottleneck.

The paper tackles the problem of static reward models (RMs) that lack adaptability at test time by proposing Variational In-Context Reward Modeling (ICRM), a Bayesian objective enabling steerability via in-context demonstrations, resulting in gains such as 34% accuracy on SafeRLHF and 9% on RM-Bench in single-objective settings.

Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained, limiting their adaptability at test time. We propose Variational In-Context Reward Modeling (ICRM), a novel Bayesian reward modeling objective that enables test-time steerability via in-context preference demonstrations. ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. We show that ICRM adapt to unseen preference distributions at test time for both single and multi-objective settings. With more in-context demonstrations, ICRM gains 34% accuracy on SafeRLHF and 9% accuracy on RM-Bench in the single-objective setting, while widening the Pareto frontier with a 4% gain in hypervolume on helpfulness and refusal benchmarks. We further study the practical applicability of ICRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning. Finally, we provide theoretical guarantees that the variational objective admits a global interior optimum with finite confidence, and we analyze how KL regularization mitigates reward over-optimization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes