AILGDec 8, 2025

Comparative Analysis and Parametric Tuning of PPO, GRPO, and DAPO for LLM Reasoning Enhancement

arXiv:2512.07611v13 citationsh-index: 2
Originality Incremental advance
AI Analysis

It provides incremental guidance for RL-based LLM training, addressing the problem of optimizing reasoning performance for AI researchers and practitioners.

This study compared three RL algorithms (PPO, GRPO, DAPO) for enhancing LLM reasoning, finding that RL-trained models outperformed base models across benchmarks, with DAPO performing best when its Dynamic Sampling component was disabled.

This study presents a systematic comparison of three Reinforcement Learning (RL) algorithms (PPO, GRPO, and DAPO) for improving complex reasoning in large language models (LLMs). Our main contribution is a controlled transfer-learning evaluation: models are first fine-tuned on the specialized Countdown Game and then assessed on a suite of general-purpose reasoning benchmarks. Across all tasks, RL-trained models outperform their corresponding base models, although the degree of improvement differs by benchmark. Our parametric analysis offers practical guidance for RL-based LLM training. Increasing the group size in GRPO and DAPO leads to more stable training dynamics and higher accuracy, while the impact of the KL-penalty coefficient is non-monotonic. Additionally, we find that the Dynamic Sampling (DS) component in DAPO does not improve performance; in fact, the best overall results are achieved with DAPO when DS is disabled.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes