LGApr 9

Efficient RL Training for LLMs with Experience Replay

arXiv:2604.0870668.31 citationsh-index: 13
AI Analysis

For practitioners training LLMs with RL, this provides a method to reduce computational costs while maintaining or improving model quality.

This work challenges the assumption that on-policy data is essential for LLM post-training, showing that a well-designed replay buffer can reduce inference compute by reusing rollouts without degrading performance, and in some cases even improving it.

While Experience Replay - the practice of storing rollouts and reusing them multiple times during training - is a foundational technique in general RL, it remains largely unexplored in LLM post-training due to the prevailing belief that fresh, on-policy data is essential for high performance. In this work, we challenge this assumption. We present a systematic study of replay buffers for LLM post-training, formalizing the optimal design as a trade-off between staleness-induced variance, sample diversity and the high computational cost of generation. We show that strict on-policy sampling is suboptimal when generation is expensive. Empirically, we show that a well-designed replay buffer can drastically reduce inference compute without degrading - and in some cases even improving - final model performance, while preserving policy entropy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes