Method Drift›LLM reasoning / chain-of-thought
Self-Refine
Self-Refine: Iterative Refinement with Self-FeedbackLLM reasoning / chain-of-thought · first seen Mar 30, 2023
superseded — cited as a baseline and beaten by newer methods
4 papers critique it · 3 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites Self-Refine as a baseline.
“Though rewriting techniques like Self-Refine~Madaan2023SelfRefineIR can help relieve, this tendency can still lead to misleading or wrong outcomes in real-world complex mathematical problems.”
— Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B“rather than relying on post-hoc reflection or global templates”
— Lateral Tree-of-Thoughts Surpasses ToT by Incorporating Logically-Consistent, Low-Utility Candidates“In all of these methods the guidance signal---critique, scoring model, abstraction prompt---is produced by an untrained, prompted LLM.”
— Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents“Sometimes LLMs can directly provide correct answers to questions, but after applying CoT-like methods, it brings extra reasoning paths to models, causing their answers to be wrong.”
— Focus on Your Question! Interpreting and Mitigating Toxic CoT Problems in Commonsense Reasoning
Beaten on benchmarks
Head-to-head results where a newer method reports beating Self-Refine. Values are copied from the source paper's tables — verify against the cited paper.
- Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct beats Self-Refine · DeepResearchBench Average [Qwen3-8B]
34.01 vs 33.71
- Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct beats Self-Refine · SQA-CS-V2 Average [Qwen3-8B]
74.80 vs 74.18
- Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct beats Self-Refine · DeepResearchBench Average [Qwen3-14B]
36.92 vs 34.24
- Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct beats Self-Refine · SQA-CS-V2 Average [Qwen3-14B]
76.05 vs 74.22
- Focus on Your Question! Interpreting and Mitigating Toxic CoT Problems in Commonsense Reasoning
RIDERS beats Self-Refine · ACC [Llama2-13B]
65.3 vs 52.4
- A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models
PASR beats Self-Refine · Avg [Qwen2.5-7B]
61.7 vs 57.5
- A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models
PASR beats Self-Refine · Avg [Qwen3-8B]
69.1 vs 65.0
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- May 27, 2026
- Tree-of-ThoughtsTree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design PatternsMay 27, 2026
- May 22, 2026
- May 22, 2026
- Novelty-based Tree-of-Thought SearchNovelty-based Tree-of-Thought Search for LLM Reasoning and PlanningMay 7, 2026
- Decoding-Time Debiasing via Process Reward ModelsDecoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended GenerationMay 4, 2026
- Apr 27, 2026
- Apr 22, 2026
- CoT-PoT ensemblingSelf-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM ReasoningApr 19, 2026
- AtroposAtropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model HotswapApr 16, 2026
- Apr 1, 2026
- Learning When to SampleLearning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought ReasoningMar 17, 2026