An Imperfect Verifier is Good Enough: Learning with Noisy Rewards
This addresses the problem of noisy reward signals in RLVR for LLM practitioners, showing that imperfect verification is not a fundamental barrier, though it is incremental in nature.
The paper investigates the robustness of Reinforcement Learning with Verifiable Rewards (RLVR) for training Large Language Models when verifiers are noisy, finding that noise rates up to 15% result in validation accuracy within 2 percentage points of clean baselines across various models and domains like code generation and scientific reasoning.
Reinforcement Learning with Verifiable Rewards (RLVR) has become a prominent method for post-training Large Language Models (LLMs). However, verifiers are rarely error-free; even deterministic checks can be inaccurate, and the growing dependence on model-based judges exacerbates the issue. The extent to which RLVR is robust to such noise and the verifier accuracy required for effective training remain unresolved questions. We investigate these questions in the domains of code generation and scientific reasoning by introducing noise into RL training. Noise rates up to 15% yield peak validation accuracy within 2 percentage points of the clean baseline. These findings are consistent across controlled and model-based noise types, three model families (Qwen3, GLM4, Llama 3.1), and model sizes from 4B to 9B. Overall, the results indicate that imperfect verification does not constitute a fundamental barrier to RLVR. Furthermore, our findings suggest that practitioners should prioritize moderate accuracy with high precision over perfect verification.