Group Sequence Policy Optimization
This addresses the challenge of stable and efficient RL training for large language models, particularly for Mixture-of-Experts architectures, though it appears incremental as it builds on existing RL algorithms.
The paper tackles the problem of training large language models with reinforcement learning by introducing Group Sequence Policy Optimization (GSPO), which uses sequence-level operations to achieve superior efficiency and performance compared to GRPO, notably stabilizing Mixture-of-Experts training and contributing to improvements in Qwen3 models.
This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.