Method Drift›LLM reasoning / chain-of-thought
ORM
LLM reasoning / chain-of-thought
heavily superseded — a standard baseline that newer methods routinely beat
10 papers critique it · 4 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites ORM as a baseline.
“Early alignment efforts relied on Outcome Reward Models (ORM)… where feedback is concentrated solely on the final solution; however, this approach faces a severe credit assignment problem”
— CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT“Process Reward Models (PRMs) enable fine-grained step-level supervision for model reasoning, addressing traditional Outcome Reward Models (ORMs) limitation of only scoring final outputs.”
— CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models“This granularity mismatch leads to inconsistent scoring between partial and complete sequences.”
— From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment“Existing works wang2024mathshepherdverifyreinforcellms, lightman2023letsverifystepstep show that PRM outperforms ORM by a considerable margin”
— Entropy-Regularized Process Reward Model“most of recent work demonstrates that ORMs fall short on complex multi-step reasoning tasks”
— Demystifying Multilingual Chain-of-Thought in Process Reward Modeling“ORM evaluates the whole reasoning process based on the final answer, ignoring intermediate steps”
— AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification“Traditional Outcome Reward Models (ORMs)...which assign rewards based solely on the final answer, fail to detect flawed intermediate reasoning.”
— MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision“Outcome Reward Models (ORMs) focus solely on evaluating final solutions for correctness, ignoring process optimality and therefore missing costly but non-terminal inefficiencies”
— When Agents go Astray: Course-Correcting SWE Agents with PRMs“ORM evaluates the entire reasoning path by assigning a single score to the final solution, whereas PRM provides step-level scores, yielding denser reward signals and generally outperforming ORM.”
— Exploring Generative Process Reward Modeling for Semi-Structured Data: A Case Study of Table Question Answering“this approach, which provides a sparse reward risks inadvertently validating reasoning trajectories that are flawed, illogical, or factually incorrect, so long as they coincidentally arrive at the correct final output”
— DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning
Beaten on benchmarks
Head-to-head results where a newer method reports beating ORM. Values are copied from the source paper's tables — verify against the cited paper.
- SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification
Sci-PRM beats ORM · accuracy [Msearth with RL training]
53.72 vs 52.12
- SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification
Sci-PRM beats ORM · accuracy [BioProBench-ERR with RL training]
69.63 vs 68.74
- SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification
Sci-PRM beats ORM · accuracy [ChemBench with RL training]
68.90 vs 65.18
- SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification
Sci-PRM beats ORM · accuracy [Mol-Instructions with RL training]
54.62 vs 40.23
- ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding
ReST-RL beats ORM · Average [Qwen3-8B]
0.689 vs 0.531
- ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding
ReST-RL beats ORM · Average [Qwen2.5-Coder-7B-Instruct]
0.673 vs 0.592
- ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding
ReST-RL beats ORM · Average [DS-Coder-6.7b-Instruct]
0.584 vs 0.542
- ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding
ReST-RL beats ORM · Average [OpenCI-DS-6.7B]
0.583 vs 0.537
- Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA
MCQ-ORM beats ORM · P-Acc [Llama3 1B AQuA]
70.0 vs 67.3
- Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA
MCQ-ORM beats ORM · P-Acc [Llama3 1B MATH]
65.7 vs 64.5
- Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA
MCQ-ORM beats ORM · P-Acc [Llama3 1B GSM8K]
88.7 vs 87.8
- Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA
MCQ-ORM beats ORM · P-Acc [Llama3 8B AQuA]
92.0 vs 90.3
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- Jun 3, 2026
- May 2, 2026
- Apr 19, 2026
- DC-W2SDC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological ReasoningMar 9, 2026
- Feb 9, 2026
- Jan 29, 2026
- Noise-Aware Iterative Training (NAIT)Towards Robust Process Reward Modeling via Noise-aware LearningJan 19, 2026
- GroundedPRMGroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level ReasoningOct 16, 2025
- group-relative advantage reinforcement learningBoosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QASep 30, 2025