LGAICLApr 18, 2025

Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

CMUStanford
arXiv:2504.13818v380 citationsh-index: 26
Originality Incremental advance
AI Analysis

This addresses efficiency bottlenecks in RL for LLMs, offering a practical speedup for researchers and practitioners, though it is incremental as it builds on existing RLVR frameworks.

The paper tackles the compute and memory asymmetry in reinforcement learning with verifiable rewards for large language models by introducing PODS, which trains on a subset of rollouts to reduce update costs while maintaining learning quality, achieving at least 1.7x faster peak test accuracy compared to vanilla methods.

Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models. However, it faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive. To address this, we introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts, maintaining learning quality while dramatically reducing update costs. We propose a principled subset selection criterion, max-variance down-sampling, that maximizes reward diversity, and provide an efficient $O(n\log n)$ implementation. Empirically, Group Relative Policy Optimization (GRPO) with PODS achieves the peak test accuracy of vanilla GRPO at least $\mathbf{1.7\times}$ faster across the different reasoning benchmarks and hardware configurations we tested.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes