Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
This work introduces a structured reward framework that improves both rubric-specific performance and general reasoning, offering a method for training more generalizable LLMs.
The authors propose rubric-grounded RL, where a frozen LLM judge scores responses on multiple task-specific criteria to provide a partial-credit optimization signal. Training Llama-3.1-8B-Instruct with GRPO on a corpus of ~100,000 scientific documents achieved 71.7% normalized reward on held-out rubric evaluation and improved performance on four reasoning benchmarks (GSM8K, MATH, GPQA Main, GPQA Diamond).
We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emph{rubric-grounded reinforcement learning (RL)}: a framework in which the policy is optimized against a structured, multi-criterion reward produced by a frozen LLM judge that conditions on auxiliary grounding the policy never sees. We instantiate the framework by deriving rubrics from an Office of Scientific and Technical Information (OSTI)-derived corpus of roughly 100,000 scientific and technical documents and training Llama-3.1-8B-Instruct with Group Relative Policy Optimization (GRPO). With GRPO-based training, the model achieves $71.7\%$ normalized reward on held-out rubric evaluation. The GRPO-tuned policy also improves over the base model on four reasoning benchmarks not derived from the training corpus -- GSM8K, MATH, GPQA Main, and GPQA Diamond. These results provide evidence that structured, document-grounded rewards can improve held-out rubric performance and induce transferable reasoning behaviors beyond the corpus used to construct the training environment.