Method Drift›LLM reasoning / chain-of-thought
Superseded baseline#20 of 772 most-superseded
VersaPRM
VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning DataLLM reasoning / chain-of-thought · first seen Feb 10, 2025
superseded — cited as a baseline and beaten by newer methods
2 papers critique it · 1 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites VersaPRM as a baseline.
“However, it still draws on an annotation procedure that employs an LLM as a step-level judge, making it potentially error-prone, and uses only three labels (good, neutral, bad).”
— Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards“there exists a substantial performance gap between these PRMs error-identification capability in general reasoning domains and that in mathematical domains”
— GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models
Beaten on benchmarks
Head-to-head results where a newer method reports beating VersaPRM. Values are copied from the source paper's tables — verify against the cited paper.
- DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning
BoN w/ Full Set beats VersaPRM · Average F1 [All cell lines]
68.50 vs 57.16
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- Jun 3, 2026
- May 2, 2026
- Apr 19, 2026
- DC-W2SDC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological ReasoningMar 9, 2026
- Feb 9, 2026
- Jan 29, 2026
- Noise-Aware Iterative Training (NAIT)Towards Robust Process Reward Modeling via Noise-aware LearningJan 19, 2026
- GroundedPRMGroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level ReasoningOct 16, 2025
- group-relative advantage reinforcement learningBoosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QASep 30, 2025