LGAIMay 7, 2025

Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

arXiv:2505.04842v115 citationsh-index: 35
Originality Incremental advance
AI Analysis

This addresses a bottleneck in RL methods for LLMs, enabling better performance and efficiency in reasoning tasks, though it is incremental by building on existing RL techniques.

The paper tackles the problem of test-time compute scaling in reinforcement learning for fine-tuning LLM reasoners by proposing RL^V, which unifies reasoners with verifiers using RL-generated data, resulting in over 20% accuracy boost on MATH and 8-32x more efficient scaling.

Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL$^V$ that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL$^V$ boosts MATH accuracy by over 20\% with parallel sampling and enables $8-32\times$ efficient test-time compute scaling compared to the base RL method. RL$^V$ also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL$^V$ achieves $1.2-1.6\times$ higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes