Self-Trained Verification for Training- and Test-Time Self-Improvement

arXiv:2605.3029098.2

AI Analysis

This work addresses the bottleneck of verification in self-improving reasoning models, enabling substantial gains on hard problems for both test-time and training-time self-improvement.

Self-trained verification (STV) improves test-time verification-refinement loops by roughly doubling accuracy on hard math and achieving a 14x improvement on scientific reasoning tasks (1.5% to 21%). At training time, verifier-in-the-loop training (ViL) yields a further 33% gain in pass@1 and a 30% relative improvement in standalone generator pass@1 beyond standard RL convergence.

Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are gated by the same bottleneck: the verifier. V-R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self-training fails similarly when bad self-generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self-generated errors, lacks training signal. To address this challenge, we propose self-trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V-R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta-verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14x on scientific reasoning tasks (1.5% to 21%). At training time, we additionally train the generator using RL with STV verifier's feedback inside the V-R loop - a procedure we call verifier-in-the-loop training (ViL). Starting from an RL-converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator's standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning on hard problems may lie in how we train for and with verification.

View on arXiv PDF

Similar