LG AIJun 2

Trading Human Curation for Synthetic Augmentation in RLVR

Akshansh, Leonardo Rosa Rodrigues, Michael Korostelev, Youssef Hassan, Mark E. Whiting

arXiv:2606.0380016.1

Predicted impact top 28% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For practitioners training agentic language models with RLVR, this work provides a cost-effective alternative to expensive human task curation, though the substitution rate varies widely.

The paper investigates whether synthetic augmentations of a small hand-authored task set can substitute for additional human curation in reinforcement learning from verifiable rewards (RLVR) for agentic language models. They find that gated synthetic tasks can replace human-authored ones with a cost-adjusted trade rate between 1.4× and 11.6× while maintaining held-out generalization across ten benchmarks.

The supply of high-quality training tasks is a central bottleneck for reinforcement learning from verifiable rewards (RLVR) on agentic language models. Each task requires a sandboxed setup, a prompt, and a hand-authored reward function, and only tasks that pass a quality bar produce useful training signal. Hand-curation at this quality bar does not scale economically to the task counts effective RL training requires, and the substitution rate between automatically generated task variants and human-authored ones is not yet established. We investigate using pre-specified, gate-filtered augmentations of a small hand-authored base as a substitute for additional human curation during RLVR. We formalize the cost-adjusted trade rate $ρ_{\text{cost}}$ between augmented and human-authored tasks, measure it through a controlled ablation across training corpora with varying augmentation share, and characterize the end-to-end economics of the augmentation pipeline. Substituting augmented content for additional human-authored tasks retains aggregate held-out generalization on a ten-benchmark suite spanning code, instruction following, reasoning, and multi-turn agentic function-calling. The cost-adjusted trade rate $ρ_{\text{cost}}$ between gated synthetic and human-authored RLVR tasks stays in $[1.4\times, 11.6\times]$ across the plausible $c_{\text{human}}/c_{\text{aug}}$ range.

View on arXiv PDF

Similar