SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

arXiv:2605.1886493.9Has Code

Predicted impact top 7% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For researchers working on reinforcement learning for LLMs, this work addresses the limitation of RLVR in exploring new reasoning modes, providing a method to improve both pass@1 and pass@k.

The paper identifies that RLVR improves pass@1 but not pass@k in LLMs due to reverse-KL regularization anchoring the policy to the reference distribution. They propose SAGE, which reshapes the anchor distribution via a guide function, achieving consistent improvements in both pass@1 and pass@k on mathematical reasoning benchmarks.

Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass@1 on reasoning tasks, yet often fails to yield comparable gains in pass@k, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model. Prior analyses largely support the latter view, attributing this limitation to structural properties of standard RLVR objectives that result in insufficient exploration pressure. In this work, we argue that a central structural constraint arises from reverse-KL regularization, which stabilizes training but inherently anchors the policy to the reference distribution, thereby suppressing the emergence of alternative reasoning modes. However, we show that neither removing the KL term nor replacing it with forward-KL provides a satisfactory solution, as both disrupt the efficiency-coverage trade-off by either inducing reward hacking or allocating probability mass to off-target regions. To resolve this tension, we propose SAGE, a principled framework that enables controllable empirical support expansion by reshaping the reverse-KL anchor distribution itself through a guide function q(x,y), achieving consistent improvements in both pass@1 and pass@k across challenging mathematical reasoning benchmarks. Our code is available at https://github.com/tally0818/SAGE.

View on arXiv PDF Code

Similar