AIMay 9

How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors

arXiv:2605.0881791.8

Predicted impact top 11% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For researchers working on LLM reasoning, this work provides a method to improve exploration in RLVR without sacrificing generation quality, addressing a known bottleneck.

The paper addresses the entropy collapse problem in reinforcement learning with verifiable rewards (RLVR) for LLM reasoning, where exploration fails to diversify successful trajectories. The proposed IMAX framework uses trainable soft prefixes and an InfoMax reward to improve exploration, achieving up to 11.60% gain in Pass@4 and 10.57% in Avg@4 over standard RLVR.

Reinforcement learning with verifiable rewards (RLVR) recently thrives in large language model (LLM) reasoning tasks. However, the reward sparsity and the long reasoning horizon make effective exploration challenging. In practice, this challenge manifests as the \emph{entropy collapse} phenomenon, where RLVR improves single-rollout accuracy but fails to expand coverage on successful reasoning trajectories. Passive exploration techniques like entropy regularization tend to dismiss generation quality, resulting in noisy rollouts. In response to this issue, we propose an Information-Maximizing Augmented eXploration (IMAX) framework to train a pool of soft prefixes that reshapes the base model's prior over reasoning trajectories. Rather than relying on RL to incentivize exploration on top of the base model, each prefix acts as a trainable control knob that induces a distinct rollout distribution from the same backbone model. To encourage discovery of diverse and task-relevant reasoning behaviors, we derive an Information Maximization (InfoMax) reward to complement the verifiable rewards for RL training. IMAX is in general algorithm-agnostic and can be seamlessly integrated into existing RLVR pipelines. Experiment results have shown that across three backbone scales, IMAX consistently improves reasoning performance over standard RLVR, with gains up to 11.60\% in Pass@4 and 10.57\% in Avg@4.

View on arXiv PDF

Similar