GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning
This addresses the problem of heavy reliance on labeled preference data for reward modeling in AI, offering a foundation model for reward reasoning with broad applicability, though it is incremental in leveraging self-training for reasoning.
The paper tackles the challenge of developing effective reward models by proposing GRAM-R^2, a self-training generative reward model that produces preference labels and reward rationales using unlabeled data, and it outperforms strong baselines in experiments on response ranking, task adaptation, and reinforcement learning from human feedback.
Significant progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs towards generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short of instilling explicit reasoning into reward models. To bridge this gap, we propose a self-training approach that leverages unlabeled data to elicit reward reasoning in reward models. Based on this approach, we develop GRAM-R$^2$, a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R$^2$ can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as response ranking and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R$^2$ consistently delivers strong performance, outperforming several strong discriminative and generative baselines.