ConsistRM: Improving Generative Reward Models via Consistency-Aware Self-Training
This work addresses scalability and stability issues in generative reward models for AI alignment, representing an incremental improvement in self-training methods.
The paper tackled the challenges of generative reward models (GRMs) for aligning large language models with human preferences, such as reliance on costly human annotations and instability in self-training, by proposing ConsistRM, a self-training framework that improved performance by an average of 1.5% over vanilla Reinforcement Fine-Tuning across five benchmark datasets.
Generative reward models (GRMs) have emerged as a promising approach for aligning Large Language Models (LLMs) with human preferences by offering greater representational capacity and flexibility than traditional scalar reward models. However, GRMs face two major challenges: reliance on costly human-annotated data restricts scalability, and self-training approaches often suffer from instability and vulnerability to reward hacking. To address these issues, we propose ConsistRM, a self-training framework that enables effective and stable GRM training without human annotations. ConsistRM incorporates the Consistency-Aware Answer Reward, which produces reliable pseudo-labels with temporal consistency, thereby providing more stable model optimization. Moreover, the Consistency-Aware Critique Reward is introduced to assess semantic consistency across multiple critiques and allocates fine-grained and differentiated rewards. Experiments on five benchmark datasets across four base models demonstrate that ConsistRM outperforms vanilla Reinforcement Fine-Tuning (RFT) by an average of 1.5%. Further analysis shows that ConsistRM enhances output consistency and mitigates position bias caused by input order, highlighting the effectiveness of consistency-aware rewards in improving GRMs.