AINov 6, 2025

RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

arXiv:2511.04285v11 citationsh-index: 21
Originality Incremental advance
AI Analysis

This addresses generalization issues in reinforcement learning for large reasoning models, representing an incremental improvement over existing methods.

The paper tackles RL overfitting in reinforcement learning for verifiable rewards by proposing RLoop, a self-improving framework with iterative policy initialization, which boosts average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.

While Reinforcement Learning for Verifiable Rewards (RLVR) is powerful for training large reasoning models, its training dynamics harbor a critical challenge: RL overfitting, where models gain training rewards but lose generalization. Our analysis reveals this is driven by policy over-specialization and catastrophic forgetting of diverse solutions generated during training. Standard optimization discards this valuable inter-step policy diversity. To address this, we introduce RLoop, a self-improving framework built on iterative policy initialization. RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset. This dataset is used via Rejection-sampling Fine-Tuning (RFT) to refine the initial policy, creating a superior starting point for the next iteration. This loop of exploration and exploitation via iterative re-initialization effectively converts transient policy variations into robust performance gains. Our experiments show RLoop mitigates forgetting and substantially improves generalization, boosting average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes