CLLGFeb 2

Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training

arXiv:2602.01511v124 citationsh-index: 9
Originality Highly original
AI Analysis

This addresses the limitation of scalar reward models in capturing multifaceted response quality for creative writing or open-ended instruction following, representing a novel method rather than an incremental improvement.

The paper tackles the problem of reward modeling for non-verifiable LLM tasks by proposing Rubric-ARM, a framework that jointly optimizes rubric generation and judgment using reinforcement learning, achieving state-of-the-art performance on multiple benchmarks and improving downstream policy alignment.

Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in non-verifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rubric generator and a judge using reinforcement learning from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates, providing theoretical analysis that demonstrates how this schedule reduces gradient variance during training. Extensive experiments show that Rubric-ARM achieves state-of-the-art performance among baselines on multiple benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes