ORM (LLM reasoning / chain-of-thought): heavily superseded — a standard baseline that newer methods routinely beat. 10 paper(s) critique it, 4 beat it on benchmarks — #4 of 772 most-superseded. Sub-problem: cluster led by ORM. Newer alternatives in the same sub-problem include SCI-PRM, GR-Ben, MedPRMBench, DC-W2S, CoTZero.

Method Drift›LLM reasoning / chain-of-thought

Heavily superseded#4 of 772 most-superseded

ORM

LLM reasoning / chain-of-thought

heavily superseded — a standard baseline that newer methods routinely beat

10 papers critique it · 4 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites ORM as a baseline.

“Early alignment efforts relied on Outcome Reward Models (ORM)… where feedback is concentrated solely on the final solution; however, this approach faces a severe credit assignment problem”
— CoTZero: Annotation-Free Human-Like Vision Reasoning via Hierarchical Synthetic CoT
“Process Reward Models (PRMs) enable fine-grained step-level supervision for model reasoning, addressing traditional Outcome Reward Models (ORMs) limitation of only scoring final outputs.”
— CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models
“This granularity mismatch leads to inconsistent scoring between partial and complete sequences.”
— From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment
“Existing works wang2024mathshepherdverifyreinforcellms, lightman2023letsverifystepstep show that PRM outperforms ORM by a considerable margin”
— Entropy-Regularized Process Reward Model
“most of recent work demonstrates that ORMs fall short on complex multi-step reasoning tasks”
— Demystifying Multilingual Chain-of-Thought in Process Reward Modeling
“ORM evaluates the whole reasoning process based on the final answer, ignoring intermediate steps”
— AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification
“Traditional Outcome Reward Models (ORMs)...which assign rewards based solely on the final answer, fail to detect flawed intermediate reasoning.”
— MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision
“Outcome Reward Models (ORMs) focus solely on evaluating final solutions for correctness, ignoring process optimality and therefore missing costly but non-terminal inefficiencies”
— When Agents go Astray: Course-Correcting SWE Agents with PRMs
“ORM evaluates the entire reasoning path by assigning a single score to the final solution, whereas PRM provides step-level scores, yielding denser reward signals and generally outperforming ORM.”
— Exploring Generative Process Reward Modeling for Semi-Structured Data: A Case Study of Table Question Answering
“this approach, which provides a sparse reward risks inadvertently validating reasoning trajectories that are flawed, illogical, or factually incorrect, so long as they coincidentally arrive at the correct final output”
— DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning

Beaten on benchmarks

Head-to-head results where a newer method reports beating ORM. Values are copied from the source paper's tables — verify against the cited paper.

Sci-PRM beats ORM · accuracy [Msearth with RL training]
53.72 vs 52.12
SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification
Sci-PRM beats ORM · accuracy [BioProBench-ERR with RL training]
69.63 vs 68.74
SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification
Sci-PRM beats ORM · accuracy [ChemBench with RL training]
68.90 vs 65.18
SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification
Sci-PRM beats ORM · accuracy [Mol-Instructions with RL training]
54.62 vs 40.23
SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification
ReST-RL beats ORM · Average [Qwen3-8B]
0.689 vs 0.531
ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding
ReST-RL beats ORM · Average [Qwen2.5-Coder-7B-Instruct]
0.673 vs 0.592
ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding
ReST-RL beats ORM · Average [DS-Coder-6.7b-Instruct]
0.584 vs 0.542
ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding
ReST-RL beats ORM · Average [OpenCI-DS-6.7B]
0.583 vs 0.537
ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding
MCQ-ORM beats ORM · P-Acc [Llama3 1B AQuA]
70.0 vs 67.3
Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA
MCQ-ORM beats ORM · P-Acc [Llama3 1B MATH]
65.7 vs 64.5
Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA
MCQ-ORM beats ORM · P-Acc [Llama3 1B GSM8K]
88.7 vs 87.8
Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA
MCQ-ORM beats ORM · P-Acc [Llama3 8B AQuA]
92.0 vs 90.3
Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.