Is Math-Shepherd superseded?

Math-Shepherd (LLM reasoning / chain-of-thought): heavily superseded — a standard baseline that newer methods routinely beat. 5 paper(s) critique it, 8 beat it on benchmarks — #5 of 772 most-superseded. Sub-problem: cluster led by ORM. Newer alternatives in the same sub-problem include SCI-PRM, GR-Ben, MedPRMBench, DC-W2S, CoTZero.

Method Drift›LLM reasoning / chain-of-thought

Heavily superseded#5 of 772 most-superseded

Math-Shepherd

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

LLM reasoning / chain-of-thought · first seen Dec 14, 2023

heavily superseded — a standard baseline that newer methods routinely beat

5 papers critique it · 8 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites Math-Shepherd as a baseline.

“However, its reliance on large-scale sampling makes it computationally expensive and primarily confined to mathematical reasoning.”
— Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs
“Math-Shepherd proposes an automated method for estimating intermediate step correctness using Monte Carlo estimation, though it generates some incorrect labels and demands extensive computational resources.”
— Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models
“training PRMs requires stepwise human annotation, which is often infeasible for open-source communities”
— Entropy-Regularized Process Reward Model
“these approaches are typically operated and optimized based on the target policy and focus on finding the first error location, which can restrict the versatility of the PRM in evaluating a wide range of policies, experience performance degradation when applied to out-of-distribution (OOD) policies, and reduce usability in optimizing subsequent RL algorithms that require complete process rewards along the trajectory”
— AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification
“Although this strategy is effective, it remains both computationally expensive and indirect.”
— Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards

Beaten on benchmarks

Head-to-head results where a newer method reports beating Math-Shepherd. Values are copied from the source paper's tables — verify against the cited paper.

Process Reward Models for Reflective Mathematical Reasoning beats Math-Shepherd · PRM@64 [MATH500]
0.816 vs 0.778
Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning
Process Reward Models for Reflective Mathematical Reasoning beats Math-Shepherd · PRM@8-step [MATH500]
0.750 vs 0.702
Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning
Process Reward Models for Reflective Mathematical Reasoning beats Math-Shepherd · PRM@8-step [AIME2024]
0.167 vs 0.100
Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning
Process Reward Models for Reflective Mathematical Reasoning beats Math-Shepherd · F1 [Step-Level Testset]
0.828 vs 0.523
Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning
\shortname{} beats Math-Shepherd · Avg. [GPT-OSS-120B]
83.0 vs 78.3
Adversarial Training for Process Reward Models
\shortname{} beats Math-Shepherd · Avg. [GPT-OSS-20B]
85.0 vs 71.5
Adversarial Training for Process Reward Models
\shortname{} beats Math-Shepherd · Avg. [Gemma-3-27B]
85.2 vs 66.0
Adversarial Training for Process Reward Models
\shortname{} beats Math-Shepherd · Avg. [Gemma-3-12B]
58.6 vs 53.0
Adversarial Training for Process Reward Models
Qwen2.5-Math-PRM-7B beats Math-Shepherd · Avg. [Best-of-8, Qwen2.5-Math-7B-Instruct, product scoring]
67.6 vs 64.2
The Lessons of Developing Process Reward Models in Mathematical Reasoning
Qwen2.5-Math-PRM-7B beats Math-Shepherd · Avg. F1 [ProcessBench, 7B+ PRMs]
73.5 vs 31.5
The Lessons of Developing Process Reward Models in Mathematical Reasoning
Qwen2.5-Math-7B-NAIT beats Math-Shepherd · Avg. F1 [MCE-based PRMs]
57.4 vs 31.5
Towards Robust Process Reward Modeling via Noise-aware Learning
Qwen2.5-Math-7B-NAIT beats Math-Shepherd · Avg. [policy model Qwen2.5-14B-Instruct]
75.9 vs 71.9
Towards Robust Process Reward Modeling via Noise-aware Learning

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.