AIOct 3, 2025

Reward Model Routing in Alignment

arXiv:2510.02850v12 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses the challenge of improving alignment for large language models, though it is incremental as it builds on existing reward model routing methods.

The paper tackles the problem of limited alignment quality and overfitting in reinforcement learning from human or AI feedback by proposing BayesianRouter, a hybrid routing framework that dynamically selects reward models from a pool, resulting in consistent outperformance on benchmarks like AlpacaEval-2 and GSM8K.

Reinforcement learning from human or AI feedback (RLHF / RLAIF) has become the standard paradigm for aligning large language models (LLMs). However, most pipelines rely on a single reward model (RM), limiting alignment quality and risking overfitting. Recent work explores RM routing--dynamically selecting an RM from a candidate pool to exploit complementary strengths while maintaining $O(1)$ RM calls--but existing methods suffer from cold-start and insufficient exploration. We propose BayesianRouter, a hybrid routing framework that combines offline RM strengths learning with online Bayesian selection. In the offline stage, a multi-task router is trained on preference data to estimate per-RM reliability. In the online stage, a Bayesian Thompson sampling router performs per-query RM selection, initializing RM-specific weight vectors with offline embeddings as Gaussian priors and adaptively updating their posteriors with online rewards to adapt to the evolving policy distribution. Extensive experiments on instruction-following (AlpacaEval-2, Arena-Hard, MT-Bench) and reasoning (GSM8K, MMLU) benchmarks show that BayesianRouter consistently outperforms individual RMs, RM ensembling, and existing routing methods.

View on arXiv PDF

Similar