CLJan 28

CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria

arXiv:2601.20327v11 citationsh-index: 13Has Code
Originality Incremental advance
AI Analysis

This addresses the problem of unreliable automatic evaluation for open-ended natural language generation in reinforcement learning applications, representing an incremental improvement over existing LLM-as-a-Judge approaches.

The paper tackles the gap between benchmark performance and practical effectiveness of LLM-based reward models in reinforcement learning by proposing CE-RM-4B, a pointwise generative reward model trained with a two-stage rollout method and unified criteria. Using only about 5.7K curated data points, it achieves superior performance on reward model benchmarks and delivers more effective improvements in downstream RL practice.

Automatic evaluation is crucial yet challenging for open-ended natural language generation, especially when rule-based metrics are infeasible. Compared with traditional methods, the recent LLM-as-a-Judge paradigms enable better and more flexible evaluation, and show promise as generative reward models for reinforcement learning. However, prior work has revealed a notable gap between their seemingly impressive benchmark performance and actual effectiveness in RL practice. We attribute this issue to some limitations in existing studies, including the dominance of pairwise evaluation and inadequate optimization of evaluation criteria. Therefore, we propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method, and adopting unified query-based criteria. Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks, especially in Best-of-N scenarios, and delivers more effective improvements in downstream RL practice.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes