Self-Hinting Language Models Enhance Reinforcement Learning

Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian

arXiv:2602.03143v19 citationsh-index: 17Has Code

Originality Incremental advance

AI Analysis

This addresses a bottleneck in aligning large language models with verifiable objectives, offering an incremental improvement for reinforcement learning practitioners.

The paper tackles the problem of Group Relative Policy Optimization (GRPO) stalling under sparse terminal rewards by proposing SAGE, a framework that injects privileged hints during training to increase rollout diversity, which improves performance over GRPO by an average of +2.0, +1.2, and +1.3 points across three LLMs on six benchmarks.

Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt $x$, the model samples a compact hint $h$ (e.g., a plan or decomposition) and then generates a solution $τ$ conditioned on $(x,h)$. Crucially, the task reward $R(x,τ)$ is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set $h=\varnothing$ and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.

View on arXiv PDF Code

Similar