New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR
This addresses the fundamental question of how reasoning emerges in AI systems, particularly for researchers in reinforcement learning and language models, though it appears incremental in its probabilistic perspective.
The paper tackles the debate on whether RLVR gives LLMs new capabilities or just elicits latent ones, proposing that complex reasoning emerges by sharpening atomic step probabilities to overcome exponential decay in multi-step chains. Empirical results show RLVR incentivizes exploring new solution paths, composite performance correlates with atomic step probabilities (ρ∈[0.69, 0.96]), and RLVR can sacrifice specific skills to maximize reward.
Whether Reinforcement Learning with Verifiable Rewards (RLVR) endows Large Language Models (LLMs) with new capabilities or merely elicits latent traces remains a central debate. In this work, we align with the former view, proposing a probabilistic framework where capability is defined by instance-level solvability. We hypothesize that the emergence of complex reasoning can be driven by sharpening atomic step probabilities, which enables models to overcome the exponential decay of success rates inherent in multi-step reasoning chains. Utilizing the Algebrarium framework, we train models exclusively on single-step operations and evaluate their performance on unseen multi-step tasks. Our empirical results confirm that: (1) RLVR incentivizes the exploration of previously inaccessible solution paths by amplifying the model's existing skills; (2) composite performance is strictly governed by the joint probability of atomic steps, evidenced by high Pearson correlation coefficients ($ρ\in [0.69, 0.96]$); and (3) RLVR, acting as a global optimizer, can cause specific skills to be sacrificed to maximize aggregate reward. Our work offers a novel explanation for emergent abilities in RLVR, suggesting that the iterative optimization of solvable problems enables models to develop the capabilities to tackle previously unsolvable scenarios.