Method Drift›LLM reasoning / chain-of-thought
Chain-of-Thought
Chain-of-Thought Prompting Elicits Reasoning in Large Language ModelsLLM reasoning / chain-of-thought · first seen Jan 28, 2022
heavily superseded — a standard baseline that newer methods routinely beat
25 papers critique it · 13 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites Chain-of-Thought as a baseline.
“errors can propagate through the reasoning chain, and there is no principled mechanism to revisit decisions or explore alternative strategies.”
— STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts“However, despite these advances, LLMs frequently exhibit limitations in their logical consistency, accuracy, and self-correction abilities when confronted with highly intricate or chaotic reasoning problems”
— Enhancing Self-Correction in Large Language Models through Multi-Perspective Reflection“However, CoT is fundamentally linear: once a reasoning step is generated, the model commits to it, often propagating early errors into final failures.”
— Tree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design Patterns“Studies have shown that simple approaches like CoT are inadequate for tasks that demand decomposition into sub-tasks”
— GOT4Rec: Graph of Thoughts for Sequential Recommendation“sprague2025to reported that on the Massive Multitask Language Understanding (MMLU) benchmark hendrycks2021measuring, 95\% of the performance gain from CoT is attributed to questions involving symbolic reasoning.”
— Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?“Nevertheless, CoT alone does not guarantee the factual correctness of the underlying statements within the reasoning chain.”
— Enhancing Factual Accuracy and Citation Generation in LLMs via Multi-Stage Self-Verification“Chain-of-Thought (CoT) prompting~wei2022chain, zhang2023multimodal, lyu2023faithful, while enhancing reasoning, may not ensure its steps visually align with the image and can be sensitive to setup or resource-intensive.”
— MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs“Although CoT has led to remarkable achievements, it does not always provide positive outcomes and sometimes hinders reasoning performance”
— Unveiling and Causalizing CoT: A Causal Pespective“it still relies on a relatively simple, linear flow of thought, which can become limiting for tasks involving more complex reasoning”
— AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification“In this paper, we reveal a strikingly counterintuitive finding: Chain-of-Thought prompting unexpectedly degrades LLM performances in certain problem-solving contexts.”
— The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning“this capability typically relies on explicit supervision (annotated CoT data), is confined to a discrete token space, and is ultimately capped by the base model's pre-trained capabilities”
— Pretraining with Token-Level Adaptive Latent Chain-of-Thought“CoT's linear structure fails to capture the branching nature of mathematical reasoning, where multiple strategies are considered, partial arguments explored, and errors necessitate backtracking”
— Limits of PRM-Guided Tree Search for Mathematical Reasoning with LLMs
Beaten on benchmarks
Head-to-head results where a newer method reports beating Chain-of-Thought. Values are copied from the source paper's tables — verify against the cited paper.
- GOT4Rec: Graph of Thoughts for Sequential Recommendation
GOT4Rec beats Chain-of-Thought · HR@20 [Games]
0.1361 vs 0.1347
- Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?
CoT+SC (n=20) beats Chain-of-Thought · Accuracy [GPT-4o]
88.93 vs 87.86
- Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?
CoT+SC (n=20) beats Chain-of-Thought · Accuracy [GPT-4o-mini]
82.61 vs 81.43
- Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?
CoT+SC (n=20) beats Chain-of-Thought · Accuracy [Qwen2.5-32B-Instruct]
82.90 vs 80.06
- Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency
Slim-SC (DP) beats Chain-of-Thought · Accuracy (%) [R1-Distill]
62.5 vs 58.8
- Capabilities and Fundamental Limits of Latent Chain-of-Thought
COCONUT beats Chain-of-Thought · Acc. (%) [ProntoQA]
99.8 vs 98.8
- Capabilities and Fundamental Limits of Latent Chain-of-Thought
COCONUT beats Chain-of-Thought · Acc. (%) [ProsQA]
97.0 vs 77.5
- Non-Interactive Symbolic-Aided Chain-of-Thought for Logical Reasoning
Symbolic-Aided CoT (ours) beats Chain-of-Thought · ProofWriter [Llama3.1-8B-Instruct]
68.67 vs 44.83
- Non-Interactive Symbolic-Aided Chain-of-Thought for Logical Reasoning
Symbolic-Aided CoT (ours) beats Chain-of-Thought · ProntoQA [Llama3.1-8B-Instruct]
89.00 vs 74.00
- Non-Interactive Symbolic-Aided Chain-of-Thought for Logical Reasoning
Symbolic-Aided CoT (ours) beats Chain-of-Thought · L.Deduction [Llama3.1-8B-Instruct]
59.33 vs 58.00
- Non-Interactive Symbolic-Aided Chain-of-Thought for Logical Reasoning
Symbolic-Aided CoT (ours) beats Chain-of-Thought · ProofWriter [Qwen3-8B]
78.67 vs 57.83
- Non-Interactive Symbolic-Aided Chain-of-Thought for Logical Reasoning
Symbolic-Aided CoT (ours) beats Chain-of-Thought · ProntoQA [Qwen3-8B]
97.20 vs 95.80
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- May 27, 2026
- Tree-of-ThoughtsTree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design PatternsMay 27, 2026
- May 22, 2026
- May 22, 2026
- Novelty-based Tree-of-Thought SearchNovelty-based Tree-of-Thought Search for LLM Reasoning and PlanningMay 7, 2026
- Decoding-Time Debiasing via Process Reward ModelsDecoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended GenerationMay 4, 2026
- Apr 27, 2026
- Apr 22, 2026
- CoT-PoT ensemblingSelf-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM ReasoningApr 19, 2026
- AtroposAtropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model HotswapApr 16, 2026
- Apr 1, 2026
- Learning When to SampleLearning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought ReasoningMar 17, 2026