SEAIMar 17

Aletheia: What Makes RLVR For Code Verifiers Tick?

arXiv:2601.1218693.12 citationsh-index: 16
Predicted impact top 6% in SE · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the adoption lag of RLVR in code generation by providing practical, size-dependent strategies for practitioners to train verifiers more efficiently, though it is incremental as it builds on existing RLVR methods.

The paper tackled the problem of high costs in training Reinforcement Learning with Verifiable Rewards (RLVR) code verifiers by ablating key drivers like thinking traces and on-policy training, finding that optimal strategies depend on model size, with on-policy learning crucial for small verifiers and thinking traces for larger ones, leading to a compute-optimal roadmap that simplifies training.

Multi-domain thinking verifiers trained via Reinforcement Learning with Verifiable Rewards (RLVR) are a cornerstone of modern post-training. However, their adoption in code generation has lagged behind execution feedback due to the prohibitive costs of the full RLVR pipeline. In this work, we ablate three primary drivers of RLVR performance and cost: intermediate thinking traces, learning from negative samples, and on-policy training. We introduce Aletheia, a controlled, execution-grounded testbed to facilitate a contamination-free analysis of code verifiers across disparate model sizes and covariate shifts. Our analysis reveals that the optimal training recipe is scale-dependent: on-policy learning is the primary performance driver for small verifiers, whereas thinking traces become the most vital factor for larger sizes. Furthermore, we show that negative samples stabilize training at large sizes, and scaling inference-time compute cannot compensate for any core RLVR component. These findings provide a compute-optimal roadmap for practitioners, offering concrete strategies to simplify verifier training based on model size. Consequently, our work establishes a foundation for scalable supervision, enabling efficiently trained code verifiers to reliably supervise much larger code generation policies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes