GRPO (LLM reasoning / chain-of-thought): superseded — cited as a baseline and beaten by newer methods. 5 paper(s) critique it, 0 beat it on benchmarks — #10 of 772 most-superseded. Sub-problem: cluster led by GRPO. Newer alternatives in the same sub-problem include IB-TPO, Perceval, CoT-Flow.

Method Drift›LLM reasoning / chain-of-thought

Superseded baseline#10 of 772 most-superseded

GRPO

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

LLM reasoning / chain-of-thought · first seen Feb 5, 2024

superseded — cited as a baseline and beaten by newer methods

5 papers critique it · 0 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites GRPO as a baseline.

“methods based on RLVR predominantly rely on GRPO, which provides only coarse, outcome-level supervision and lacks the fine-grained signals necessary for improving complex, step-by-step reasoning”
— Improving Vision-language Models with Perception-centric Process Reward Models
“online RL algorithms represented by GRPO often lead to unsatisfactory training results due to insignificant differences in reward signals”
— ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding
“This introduces the credit assignment problem, as the scalar signal at step T fails to distinguish the contribution of each token s_t.”
— Efficient Paths and Dense Rewards: Probabilistic Flow Reasoning for Large Language Models
“Inadequate exploration, as independent sampling strategy struggles to produce diverse trajectories with sufficient exploration due to structural inefficiency”
— Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization
“reward design is intrinsically noisy. Unlike multi-hop search with near-unique references, many practical tool-use tasks admit multiple valid outputs (e.g., recommendations). As a result, outcome-only rewards induce high-variance gradients and provide weak incentives for reasoning, even when augmented with LLM-as-a-judge or learned reward models”
— ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.