CLAILGJun 11, 2025

RePO: Replay-Enhanced Policy Optimization

arXiv:2506.09340v127 citationsh-index: 9Has Code
Originality Incremental advance
AI Analysis

This work addresses efficiency issues in RL for LLMs, offering a domain-specific improvement for optimizing models in tasks like mathematical reasoning.

The paper tackles the high computational cost and low data efficiency of Group Relative Policy Optimization (GRPO) in reinforcement learning for large language models by introducing Replay-Enhanced Policy Optimization (RePO), which uses replay strategies to incorporate off-policy samples, resulting in absolute average performance gains of 18.4 and 4.1 points on mathematical reasoning benchmarks for specific models compared to GRPO.

Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of $18.4$ and $4.1$ points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by $15\%$ while raising the number of effective optimization steps by $48\%$ for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to $8$. The repository can be accessed at https://github.com/SihengLi99/RePO.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes