Method Drift›LLM reasoning / chain-of-thought
Best-of-N
LLM reasoning / chain-of-thought
superseded — cited as a baseline and beaten by newer methods
3 papers critique it · 4 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites Best-of-N as a baseline.
“the far simpler method of majority voting wang2023selfconsistency, which completely ignores the expensive PRM and relies solely on the consensus of the LLM's own generations, can outperform PRM-guided BoN.”
— Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling“restricting PRM evaluation to complete CoTs misses opportunities for dynamic guidance”
— Limits of PRM-Guided Tree Search for Mathematical Reasoning with LLMs“In all of these methods the guidance signal---critique, scoring model, abstraction prompt---is produced by an untrained, prompted LLM.”
— Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Beaten on benchmarks
Head-to-head results where a newer method reports beating Best-of-N. Values are copied from the source paper's tables — verify against the cited paper.
- Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling
Logit WV beats Best-of-N · Average [Qwen-PRM800K-7B]
57.6 vs 52.7
- Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling
Logit WV beats Best-of-N · Average [Qwen-PRM-7B]
63.3 vs 61.8
- Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling
Logit WV beats Best-of-N · Average [Llama3.1-Mistral-8B]
62.6 vs 58.3
- Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling
Logit WV beats Best-of-N · Average [Llama3.1-DS-8B]
61.5 vs 58.9
- Know What You Don't Know: Uncertainty Calibration of Process Reward Models
BoN+IAS w/ Calib. PRM beats Best-of-N · Budget Ratio [MATH500 / Llama-3.2-1B]
0.6381 vs 1.0
- Know What You Don't Know: Uncertainty Calibration of Process Reward Models
BoN+IAS w/ Calib. PRM beats Best-of-N · Budget Ratio [MATH500 / Qwen-2.5-7B]
0.2342 vs 1.0
- Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation
Sequential beats Best-of-N · Bias [GPT-4o-mini, English]
0.916 vs 0.573
- Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct beats Best-of-N · DeepResearchBench Average [Qwen3-8B]
34.01 vs 33.27
- Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct beats Best-of-N · SQA-CS-V2 Average [Qwen3-8B]
74.80 vs 70.08
- Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct beats Best-of-N · DeepResearchBench Average [Qwen3-14B]
36.92 vs 33.19
- Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct beats Best-of-N · SQA-CS-V2 Average [Qwen3-14B]
76.05 vs 70.81
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- May 27, 2026
- Tree-of-ThoughtsTree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design PatternsMay 27, 2026
- May 22, 2026
- May 22, 2026
- Novelty-based Tree-of-Thought SearchNovelty-based Tree-of-Thought Search for LLM Reasoning and PlanningMay 7, 2026
- Decoding-Time Debiasing via Process Reward ModelsDecoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended GenerationMay 4, 2026
- Apr 27, 2026
- Apr 22, 2026
- CoT-PoT ensemblingSelf-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM ReasoningApr 19, 2026
- AtroposAtropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model HotswapApr 16, 2026
- Apr 1, 2026
- Learning When to SampleLearning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought ReasoningMar 17, 2026