Is Chain-of-Thought superseded?

Chain-of-Thought (LLM reasoning / chain-of-thought): heavily superseded — a standard baseline that newer methods routinely beat. 25 paper(s) critique it, 13 beat it on benchmarks — #1 of 772 most-superseded. Sub-problem: cluster led by Chain-of-Thought. Newer alternatives in the same sub-problem include Marginal Sharpening, Tree-of-Thoughts, Co-ReAct, MA-CoT, Novelty-based Tree-of-Thought Search.

Method Drift›LLM reasoning / chain-of-thought

Heavily superseded#1 of 772 most-superseded

Chain-of-Thought

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

LLM reasoning / chain-of-thought · first seen Jan 28, 2022

heavily superseded — a standard baseline that newer methods routinely beat

25 papers critique it · 13 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites Chain-of-Thought as a baseline.

“errors can propagate through the reasoning chain, and there is no principled mechanism to revisit decisions or explore alternative strategies.”
— STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts
“However, despite these advances, LLMs frequently exhibit limitations in their logical consistency, accuracy, and self-correction abilities when confronted with highly intricate or chaotic reasoning problems”
— Enhancing Self-Correction in Large Language Models through Multi-Perspective Reflection
“However, CoT is fundamentally linear: once a reasoning step is generated, the model commits to it, often propagating early errors into final failures.”
— Tree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design Patterns
“Studies have shown that simple approaches like CoT are inadequate for tasks that demand decomposition into sub-tasks”
— GOT4Rec: Graph of Thoughts for Sequential Recommendation
“sprague2025to reported that on the Massive Multitask Language Understanding (MMLU) benchmark hendrycks2021measuring, 95\% of the performance gain from CoT is attributed to questions involving symbolic reasoning.”
— Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?
“Nevertheless, CoT alone does not guarantee the factual correctness of the underlying statements within the reasoning chain.”
— Enhancing Factual Accuracy and Citation Generation in LLMs via Multi-Stage Self-Verification
“Chain-of-Thought (CoT) prompting~wei2022chain, zhang2023multimodal, lyu2023faithful, while enhancing reasoning, may not ensure its steps visually align with the image and can be sensitive to setup or resource-intensive.”
— MRFD: Multi-Region Fusion Decoding with Self-Consistency for Mitigating Hallucinations in LVLMs
“Although CoT has led to remarkable achievements, it does not always provide positive outcomes and sometimes hinders reasoning performance”
— Unveiling and Causalizing CoT: A Causal Pespective
“it still relies on a relatively simple, linear flow of thought, which can become limiting for tasks involving more complex reasoning”
— AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification
“In this paper, we reveal a strikingly counterintuitive finding: Chain-of-Thought prompting unexpectedly degrades LLM performances in certain problem-solving contexts.”
— The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning
“this capability typically relies on explicit supervision (annotated CoT data), is confined to a discrete token space, and is ultimately capped by the base model's pre-trained capabilities”
— Pretraining with Token-Level Adaptive Latent Chain-of-Thought
“CoT's linear structure fails to capture the branching nature of mathematical reasoning, where multiple strategies are considered, partial arguments explored, and errors necessitate backtracking”
— Limits of PRM-Guided Tree Search for Mathematical Reasoning with LLMs

Beaten on benchmarks

Head-to-head results where a newer method reports beating Chain-of-Thought. Values are copied from the source paper's tables — verify against the cited paper.

GOT4Rec beats Chain-of-Thought · HR@20 [Games]
0.1361 vs 0.1347
GOT4Rec: Graph of Thoughts for Sequential Recommendation
CoT+SC (n=20) beats Chain-of-Thought · Accuracy [GPT-4o]
88.93 vs 87.86
Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?
CoT+SC (n=20) beats Chain-of-Thought · Accuracy [GPT-4o-mini]
82.61 vs 81.43
Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?
CoT+SC (n=20) beats Chain-of-Thought · Accuracy [Qwen2.5-32B-Instruct]
82.90 vs 80.06
Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?
Slim-SC (DP) beats Chain-of-Thought · Accuracy (%) [R1-Distill]
62.5 vs 58.8
Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency
COCONUT beats Chain-of-Thought · Acc. (%) [ProntoQA]
99.8 vs 98.8
Capabilities and Fundamental Limits of Latent Chain-of-Thought
COCONUT beats Chain-of-Thought · Acc. (%) [ProsQA]
97.0 vs 77.5
Capabilities and Fundamental Limits of Latent Chain-of-Thought
Symbolic-Aided CoT (ours) beats Chain-of-Thought · ProofWriter [Llama3.1-8B-Instruct]
68.67 vs 44.83
Non-Interactive Symbolic-Aided Chain-of-Thought for Logical Reasoning
Symbolic-Aided CoT (ours) beats Chain-of-Thought · ProntoQA [Llama3.1-8B-Instruct]
89.00 vs 74.00
Non-Interactive Symbolic-Aided Chain-of-Thought for Logical Reasoning
Symbolic-Aided CoT (ours) beats Chain-of-Thought · L.Deduction [Llama3.1-8B-Instruct]
59.33 vs 58.00
Non-Interactive Symbolic-Aided Chain-of-Thought for Logical Reasoning
Symbolic-Aided CoT (ours) beats Chain-of-Thought · ProofWriter [Qwen3-8B]
78.67 vs 57.83
Non-Interactive Symbolic-Aided Chain-of-Thought for Logical Reasoning
Symbolic-Aided CoT (ours) beats Chain-of-Thought · ProntoQA [Qwen3-8B]
97.20 vs 95.80
Non-Interactive Symbolic-Aided Chain-of-Thought for Logical Reasoning

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.