Is Best-of-N superseded?

Best-of-N (LLM reasoning / chain-of-thought): superseded — cited as a baseline and beaten by newer methods. 3 paper(s) critique it, 4 beat it on benchmarks — #8 of 772 most-superseded. Sub-problem: cluster led by Chain-of-Thought. Newer alternatives in the same sub-problem include Marginal Sharpening, Tree-of-Thoughts, Co-ReAct, MA-CoT, Novelty-based Tree-of-Thought Search.

Method Drift›LLM reasoning / chain-of-thought

Superseded baseline#8 of 772 most-superseded

Best-of-N

LLM reasoning / chain-of-thought

superseded — cited as a baseline and beaten by newer methods

3 papers critique it · 4 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites Best-of-N as a baseline.

“the far simpler method of majority voting wang2023selfconsistency, which completely ignores the expensive PRM and relies solely on the consensus of the LLM's own generations, can outperform PRM-guided BoN.”
— Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling
“restricting PRM evaluation to complete CoTs misses opportunities for dynamic guidance”
— Limits of PRM-Guided Tree Search for Mathematical Reasoning with LLMs
“In all of these methods the guidance signal---critique, scoring model, abstraction prompt---is produced by an untrained, prompted LLM.”
— Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

Beaten on benchmarks

Head-to-head results where a newer method reports beating Best-of-N. Values are copied from the source paper's tables — verify against the cited paper.

Logit WV beats Best-of-N · Average [Qwen-PRM800K-7B]
57.6 vs 52.7
Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling
Logit WV beats Best-of-N · Average [Qwen-PRM-7B]
63.3 vs 61.8
Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling
Logit WV beats Best-of-N · Average [Llama3.1-Mistral-8B]
62.6 vs 58.3
Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling
Logit WV beats Best-of-N · Average [Llama3.1-DS-8B]
61.5 vs 58.9
Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling
BoN+IAS w/ Calib. PRM beats Best-of-N · Budget Ratio [MATH500 / Llama-3.2-1B]
0.6381 vs 1.0
Know What You Don't Know: Uncertainty Calibration of Process Reward Models
BoN+IAS w/ Calib. PRM beats Best-of-N · Budget Ratio [MATH500 / Qwen-2.5-7B]
0.2342 vs 1.0
Know What You Don't Know: Uncertainty Calibration of Process Reward Models
Sequential beats Best-of-N · Bias [GPT-4o-mini, English]
0.916 vs 0.573
Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation
Co-ReAct beats Best-of-N · DeepResearchBench Average [Qwen3-8B]
34.01 vs 33.27
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct beats Best-of-N · SQA-CS-V2 Average [Qwen3-8B]
74.80 vs 70.08
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct beats Best-of-N · DeepResearchBench Average [Qwen3-14B]
36.92 vs 33.19
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct beats Best-of-N · SQA-CS-V2 Average [Qwen3-14B]
76.05 vs 70.81
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.