LG AI CLJul 23, 2025

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, Sean Hendryx

arXiv:2507.17746v249.8219 citationsh-index: 11

Originality Incremental advance

AI Analysis

This addresses the problem of applying reinforcement learning to complex, non-verifiable domains like medical and science reasoning, though it is incremental as it builds on existing rubric-based evaluation methods.

The paper tackles the challenge of extending reinforcement learning to real-world reasoning tasks where evaluation relies on nuanced, multi-criteria judgments rather than binary correctness, by introducing Rubrics as Rewards (RaR), which achieves relative improvements of up to 31% on HealthBench and 7% on GPQA-Diamond over baselines.

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective for complex reasoning tasks with clear correctness signals such as math and coding. However, extending it to real-world reasoning tasks is challenging, as evaluation depends on nuanced, multi-criteria judgments rather than binary correctness. Instance-specific rubrics have recently been used in evaluation benchmarks to capture such judgments, but their potential as reward signals for on-policy post-training remains underexplored. We introduce $\textbf{Rubrics as Rewards}$ (RaR), an on-policy reinforcement learning method that extends RLVR beyond verifiable domains by using rubric-based feedback. Across both medical and science domains, we evaluate multiple strategies for aggregating rubric feedback into rewards. The best RaR variant achieves relative improvements of up to $31\%$ on HealthBench and $7\%$ on GPQA-Diamond over popular LLM-as-judge baselines that rely on direct Likert-based rewards. These results demonstrate that RaR-trained policies adapt well to diverse evaluation formats, performing strongly on both rubric-based and multiple-choice tasks. Moreover, we find that using rubrics as structured reward signals yields better alignment for smaller judges and reduces performance variance across judge scales.

View on arXiv PDF

Similar