CLNov 6, 2025

Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models

arXiv:2511.04800v14 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in scaling RLVR for reasoning LLMs, offering an incremental improvement to enhance training efficiency and model performance.

The paper tackles the problem of residual prompts with zero variance rewards in RLVR for reasoning LLMs, which reduces training diversity and effectiveness; the proposed ERPO framework reactivates these prompts by encouraging exploration, leading to consistent performance gains over baselines on mathematical reasoning benchmarks.

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs). The Group Relative Policy Optimization (GRPO) family has demonstrated strong performance in training LLMs with RLVR. However, as models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal. Consequently, fewer prompts contribute to training, reducing diversity and hindering effectiveness. To fully exploit these residual prompts, we propose the Explore Residual Prompts in Policy Optimization (ERPO) framework, which encourages exploration on residual prompts and reactivates their training signals. ERPO maintains a history tracker for each prompt and adaptively increases the sampling temperature for residual prompts that previously produced all correct responses. This encourages the model to generate more diverse reasoning traces, introducing incorrect responses that revive training signals. Empirical results on the Qwen2.5 series demonstrate that ERPO consistently surpasses strong baselines across multiple mathematical reasoning benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes