Method Drift›LLM reasoning / chain-of-thought
PRIME
LLM reasoning / chain-of-thought
superseded — cited as a baseline and beaten by newer methods
2 papers critique it · 1 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites PRIME as a baseline.
“this specific format only holds under certain assumptions”
— rePIRL: Learn PRM with Inverse RL for LLM Reasoning“PRIME differs crucially by: [(a)] Employing per-token rewards derived from log-likelihood ratios, which reward-guided generation literatures (discrete GANs, human preference modeling, generation quality evaluation etc) suggests is much less effective than our holistic step-wise discriminators.”
— Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS
Beaten on benchmarks
Head-to-head results where a newer method reports beating PRIME. Values are copied from the source paper's tables — verify against the cited paper.
- rePIRL: Learn PRM with Inverse RL for LLM Reasoning
rePIRL beats PRIME · Math Avg. [Qwen2.5-3B-Instruct Math]
33.5 vs 29.9
- rePIRL: Learn PRM with Inverse RL for LLM Reasoning
rePIRL beats PRIME · Coding Avg. [Qwen2.5-3B-Instruct Coding]
27.7 vs 25.2
- rePIRL: Learn PRM with Inverse RL for LLM Reasoning
rePIRL beats PRIME · Math Avg. [Qwen3-4B-Base Math]
41.6 vs 39.3
- rePIRL: Learn PRM with Inverse RL for LLM Reasoning
rePIRL beats PRIME · Coding Avg. [Qwen3-4B-Base Coding]
39.8 vs 31.1
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- May 19, 2026
- May 4, 2026