Method Drift›LLM reasoning / chain-of-thought
ToT
Tree of Thoughts: Deliberate Problem Solving with Large Language ModelsLLM reasoning / chain-of-thought · first seen May 17, 2023
heavily superseded — a standard baseline that newer methods routinely beat
9 papers critique it · 5 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites ToT as a baseline.
“ToT implementations perform a predetermined number of reasoning steps, which can lead to ``overthinking''”
— STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts“This deliberate structure improves performance over standard Chain-of-Thought (CoT) prompting but incurs high computational cost.”
— Dual-Track CoT: Budget-Aware Stepwise Guidance for Small LMs“The critical bottleneck in this workflow lies with the state evaluator. In the original ToT work, the evaluator relies on expensive LLM self-reflection, which involves prompting the model to critique its own outputs. This introduces substantial computational overhead, making the process impractical for many applications.”
— Domain-Specialized Tree of Thought through Plug-and-Play Predictors“we demonstrate that CoT and its reasoning variants (e.g., ToT, ReAct) consistently underperform direct answering by a significant margin”
— The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning“reasoning trees exhibit intractable branching factors and depth, while PRMs may fail to accurately evaluate intermediate steps”
— Limits of PRM-Guided Tree Search for Mathematical Reasoning with LLMs“In all of these methods the guidance signal---critique, scoring model, abstraction prompt---is produced by an untrained, prompted LLM.”
— Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents“Attempts to remedy this through more complicated methods such as Tree of Thoughts (ToT) suffer from drawbacks such as high computation cost. In ToT specifically, the cost stems from branching "thoughts" that lead to exponential runtime and token usage during the graph search.”
— Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning“ToT generates multiple leaf nodes as potential answers, but without a verifier, it is unclear which leaf node should be selected as the final solution.”
— BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical Modeling Problem Solving“Tree-of-Thought (ToT) performs hierarchical branching but may suffer from exponential growth”
— Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution
Beaten on benchmarks
Head-to-head results where a newer method reports beating ToT. Values are copied from the source paper's tables — verify against the cited paper.
- Adversarial Training for Process Reward Models
\shortname{} beats ToT · Avg. [GPT-OSS-120B]
83.0 vs 79.9
- Adversarial Training for Process Reward Models
\shortname{} beats ToT · Avg. [GPT-OSS-20B]
85.0 vs 80.7
- Adversarial Training for Process Reward Models
\shortname{} beats ToT · Avg. [Gemma-3-27B]
85.2 vs 80.2
- Adversarial Training for Process Reward Models
\shortname{} beats ToT · Avg. [Gemma-3-12B]
58.6 vs 55.3
- Adversarial Training for Process Reward Models
\shortname{} beats ToT · Overall [GPT-OSS-120B]
64.0 vs 61.4
- Adversarial Training for Process Reward Models
\shortname{} beats ToT · Overall [GPT-OSS-20B]
63.0 vs 61.7
- Adversarial Training for Process Reward Models
\shortname{} beats ToT · Overall [Gemma-3-27B-IT]
63.5 vs 61.1
- ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
ReST-MCTS beats ToT · Ave. [GLM4]
16.77 vs 15.82
- ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
ReST-MCTS beats ToT · Ave. [GPT-3.5-turbo]
10.06 vs 8.44
- ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
ReST-MCTS beats ToT · Ave. [LLaMA2-13B-Chat]
2.90 vs 2.37
- ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search
ReST-MCTS beats ToT · Ave. [GPT-3.5-Turbo]
62.31 vs 61.06
- The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning
Direct beats ToT · Acc (%) [All models]
17.11 vs 8.85
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- May 27, 2026
- Tree-of-ThoughtsTree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design PatternsMay 27, 2026
- May 22, 2026
- May 22, 2026
- Novelty-based Tree-of-Thought SearchNovelty-based Tree-of-Thought Search for LLM Reasoning and PlanningMay 7, 2026
- Decoding-Time Debiasing via Process Reward ModelsDecoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended GenerationMay 4, 2026
- Apr 27, 2026
- Apr 22, 2026
- CoT-PoT ensemblingSelf-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM ReasoningApr 19, 2026
- AtroposAtropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model HotswapApr 16, 2026
- Apr 1, 2026
- Learning When to SampleLearning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought ReasoningMar 17, 2026