It Takes Two: Your GRPO Is Secretly DPO
This work addresses computational inefficiency in post-training Large Language Models for researchers and practitioners by enabling faster training with minimal performance loss.
The paper challenges the assumption that Group Relative Policy Optimization (GRPO) requires large group sizes for stable training by reframing it as contrastive learning, linking it to Direct Preference Optimization (DPO), and shows that a minimal two-rollout configuration (2-GRPO) achieves performance comparable to 16-GRPO while using only 1/8 of the rollouts and reducing training time by over 70%.
Group Relative Policy Optimization (GRPO) is a prominent reinforcement learning algorithm for post-training Large Language Models (LLMs). It is commonly believed that GRPO necessitates a large group size to ensure stable training via precise statistical estimation, which incurs substantial computational overhead. In this work, we challenge this assumption by reframing GRPO as a form of contrastive learning, which reveals a fundamental connection to Direct Preference Optimization (DPO). Motivated by DPO's empirical success, we investigate the minimal two-rollout case (2-GRPO), a configuration previously deemed infeasible. We provide a rigorous theoretical analysis to validate 2-GRPO and demonstrate empirically that it achieves performance on par with 16-GRPO, despite using only 1/8 of the rollouts and reducing training time by over 70%.