LGAICLMLOct 5, 2025

Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

arXiv:2510.04072v24 citationsh-index: 21
Originality Incremental advance
AI Analysis

This addresses inefficiencies in reinforcement learning training for LLM reasoning, offering a plug-compatible solution with incremental improvements over existing methods.

The paper tackles the problem of unstable updates and inefficient exploration in on-policy reinforcement learning for large language model reasoning by introducing Slow-Fast Policy Optimization, which improves stability, reduces rollouts by up to 4.93x, and accelerates convergence with up to a 2.80-point gain in math reasoning benchmarks.

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93\texttimes{} fewer rollouts and an up to 4.19\texttimes{} reduction in wall-clock time to match GRPO's best accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes