Method Drift›LLM reasoning / chain-of-thought
Math-Shepherd
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human AnnotationsLLM reasoning / chain-of-thought · first seen Dec 14, 2023
heavily superseded — a standard baseline that newer methods routinely beat
5 papers critique it · 8 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites Math-Shepherd as a baseline.
“However, its reliance on large-scale sampling makes it computationally expensive and primarily confined to mathematical reasoning.”
— Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs“Math-Shepherd proposes an automated method for estimating intermediate step correctness using Monte Carlo estimation, though it generates some incorrect labels and demands extensive computational resources.”
— Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models“training PRMs requires stepwise human annotation, which is often infeasible for open-source communities”
— Entropy-Regularized Process Reward Model“these approaches are typically operated and optimized based on the target policy and focus on finding the first error location, which can restrict the versatility of the PRM in evaluating a wide range of policies, experience performance degradation when applied to out-of-distribution (OOD) policies, and reduce usability in optimizing subsequent RL algorithms that require complete process rewards along the trajectory”
— AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification“Although this strategy is effective, it remains both computationally expensive and indirect.”
— Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards
Beaten on benchmarks
Head-to-head results where a newer method reports beating Math-Shepherd. Values are copied from the source paper's tables — verify against the cited paper.
- Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning
Process Reward Models for Reflective Mathematical Reasoning beats Math-Shepherd · PRM@64 [MATH500]
0.816 vs 0.778
- Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning
Process Reward Models for Reflective Mathematical Reasoning beats Math-Shepherd · PRM@8-step [MATH500]
0.750 vs 0.702
- Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning
Process Reward Models for Reflective Mathematical Reasoning beats Math-Shepherd · PRM@8-step [AIME2024]
0.167 vs 0.100
- Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning
Process Reward Models for Reflective Mathematical Reasoning beats Math-Shepherd · F1 [Step-Level Testset]
0.828 vs 0.523
- Adversarial Training for Process Reward Models
\shortname{} beats Math-Shepherd · Avg. [GPT-OSS-120B]
83.0 vs 78.3
- Adversarial Training for Process Reward Models
\shortname{} beats Math-Shepherd · Avg. [GPT-OSS-20B]
85.0 vs 71.5
- Adversarial Training for Process Reward Models
\shortname{} beats Math-Shepherd · Avg. [Gemma-3-27B]
85.2 vs 66.0
- Adversarial Training for Process Reward Models
\shortname{} beats Math-Shepherd · Avg. [Gemma-3-12B]
58.6 vs 53.0
- The Lessons of Developing Process Reward Models in Mathematical Reasoning
Qwen2.5-Math-PRM-7B beats Math-Shepherd · Avg. [Best-of-8, Qwen2.5-Math-7B-Instruct, product scoring]
67.6 vs 64.2
- The Lessons of Developing Process Reward Models in Mathematical Reasoning
Qwen2.5-Math-PRM-7B beats Math-Shepherd · Avg. F1 [ProcessBench, 7B+ PRMs]
73.5 vs 31.5
- Towards Robust Process Reward Modeling via Noise-aware Learning
Qwen2.5-Math-7B-NAIT beats Math-Shepherd · Avg. F1 [MCE-based PRMs]
57.4 vs 31.5
- Towards Robust Process Reward Modeling via Noise-aware Learning
Qwen2.5-Math-7B-NAIT beats Math-Shepherd · Avg. [policy model Qwen2.5-14B-Instruct]
75.9 vs 71.9
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- Jun 3, 2026
- May 2, 2026
- Apr 19, 2026
- DC-W2SDC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological ReasoningMar 9, 2026
- Feb 9, 2026
- Jan 29, 2026
- Noise-Aware Iterative Training (NAIT)Towards Robust Process Reward Modeling via Noise-aware LearningJan 19, 2026
- GroundedPRMGroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level ReasoningOct 16, 2025
- group-relative advantage reinforcement learningBoosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QASep 30, 2025