Method Drift›LLM reasoning / chain-of-thought
Superseded baseline#10 of 772 most-superseded
GRPO
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language ModelsLLM reasoning / chain-of-thought · first seen Feb 5, 2024
superseded — cited as a baseline and beaten by newer methods
5 papers critique it · 0 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites GRPO as a baseline.
“methods based on RLVR predominantly rely on GRPO, which provides only coarse, outcome-level supervision and lacks the fine-grained signals necessary for improving complex, step-by-step reasoning”
— Improving Vision-language Models with Perception-centric Process Reward Models“online RL algorithms represented by GRPO often lead to unsatisfactory training results due to insignificant differences in reward signals”
— ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding“This introduces the credit assignment problem, as the scalar signal at step T fails to distinguish the contribution of each token s_t.”
— Efficient Paths and Dense Rewards: Probabilistic Flow Reasoning for Large Language Models“Inadequate exploration, as independent sampling strategy struggles to produce diverse trajectories with sufficient exploration due to structural inefficiency”
— Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization“reward design is intrinsically noisy. Unlike multi-hop search with near-unique references, many practical tool-use tasks admit multiple valid outputs (e.g., recommendations). As a result, outcome-only rewards induce high-variance gradients and provide weak incentives for reasoning, even when augmented with LLM-as-a-judge or learned reward models”
— ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- May 27, 2026
- Apr 27, 2026
- Jan 14, 2026