LG AINov 7, 2025

You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

Shuvendu Roy, Hossein Hajimirsadeghi, Mengyao Zhai, Golnoosh Samei

arXiv:2511.04902v14.1h-index: 10Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of scaling unsupervised RL for reasoning to resource-constrained models, though it appears incremental as it builds on existing label-free RL approaches.

The paper investigates the limitations of label-free reinforcement learning (RL) methods for enhancing reasoning in smaller language models (0.5B to 7B parameters), finding that performance often degrades below baseline levels for weaker models due to insufficient chain-of-thought reasoning. It proposes a curriculum learning method with data curation, showing consistent improvements across all model sizes.

Recent advances in large language models have demonstrated the promise of unsupervised reinforcement learning (RL) methods for enhancing reasoning capabilities without external supervision. However, the generalizability of these label-free RL approaches to smaller base models with limited reasoning capabilities remains unexplored. In this work, we systematically investigate the performance of label-free RL methods across different model sizes and reasoning strengths, from 0.5B to 7B parameters. Our empirical analysis reveals critical limitations: label-free RL is highly dependent on the base model's pre-existing reasoning capability, with performance often degrading below baseline levels for weaker models. We find that smaller models fail to generate sufficiently long or diverse chain-of-thought reasoning to enable effective self-reflection, and that training data difficulty plays a crucial role in determining success. To address these challenges, we propose a simple yet effective method for label-free RL that utilizes curriculum learning to progressively introduce harder problems during training and mask no-majority rollouts during training. Additionally, we introduce a data curation pipeline to generate samples with predefined difficulty. Our approach demonstrates consistent improvements across all model sizes and reasoning capabilities, providing a path toward more robust unsupervised RL that can bootstrap reasoning abilities in resource-constrained models. We make our code available at https://github.com/BorealisAI/CuMa

View on arXiv PDF Code

Similar