AI LGApr 20

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Justin Bauer, Thomas Walshe, Derek Pham, Harit Vishwakarma, Armin Parchami, Frederic Sala, Paroma Varma

arXiv:2604.1838190.8h-index: 15Has Code

Predicted impact top 18% in AI · last 90 daysOriginality Synthesis-oriented

AI Analysis

For practitioners with limited data/compute, it provides actionable insights on efficient RLVR fine-tuning, though findings are incremental and domain-specific.

This paper studies RLVR fine-tuning of small language models in low data and compute regimes, showing that training on mixed-complexity datasets yields up to 5x sample efficiency over easy tasks and enables generalization to harder tasks.

Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where annotated data and accessible compute may be scarce. In this work, we present a comprehensive empirical study of open-source Small Language Model (SLM) performance after RLVR in low data regimes. Across three novel datasets covering number counting problems, graph reasoning, and spatial reasoning, we characterize how model performance scales with dataset size, diversity, and complexity. We demonstrate that (1) procedural datasets allow for fine-grained evaluation and training dataset development with controllable properties (size, diversity, and complexity), (2) under RLVR, models trained on lower complexity tasks can generalize to higher complexity tasks, and (3) training on mixed complexity datasets is associated with the greatest benefits in low data regimes, providing up to 5x sample efficiency versus training on easy tasks. These findings inspire future work on the development of data scaling laws for RLVR and the use of procedural data generators to further understand effective data development for efficient LLM fine-tuning.

View on arXiv PDF

Similar