Is Self-Refine superseded?

Self-Refine (LLM reasoning / chain-of-thought): superseded — cited as a baseline and beaten by newer methods. 4 paper(s) critique it, 3 beat it on benchmarks — #9 of 772 most-superseded. Sub-problem: cluster led by Chain-of-Thought. Newer alternatives in the same sub-problem include Marginal Sharpening, Tree-of-Thoughts, Co-ReAct, MA-CoT, Novelty-based Tree-of-Thought Search.

Method Drift›LLM reasoning / chain-of-thought

Superseded baseline#9 of 772 most-superseded

Self-Refine

Self-Refine: Iterative Refinement with Self-Feedback

LLM reasoning / chain-of-thought · first seen Mar 30, 2023

superseded — cited as a baseline and beaten by newer methods

4 papers critique it · 3 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites Self-Refine as a baseline.

“Though rewriting techniques like Self-Refine~Madaan2023SelfRefineIR can help relieve, this tendency can still lead to misleading or wrong outcomes in real-world complex mathematical problems.”
— Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B
“rather than relying on post-hoc reflection or global templates”
— Lateral Tree-of-Thoughts Surpasses ToT by Incorporating Logically-Consistent, Low-Utility Candidates
“In all of these methods the guidance signal---critique, scoring model, abstraction prompt---is produced by an untrained, prompted LLM.”
— Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
“Sometimes LLMs can directly provide correct answers to questions, but after applying CoT-like methods, it brings extra reasoning paths to models, causing their answers to be wrong.”
— Focus on Your Question! Interpreting and Mitigating Toxic CoT Problems in Commonsense Reasoning

Beaten on benchmarks

Head-to-head results where a newer method reports beating Self-Refine. Values are copied from the source paper's tables — verify against the cited paper.

Co-ReAct beats Self-Refine · DeepResearchBench Average [Qwen3-8B]
34.01 vs 33.71
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct beats Self-Refine · SQA-CS-V2 Average [Qwen3-8B]
74.80 vs 74.18
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct beats Self-Refine · DeepResearchBench Average [Qwen3-14B]
36.92 vs 34.24
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
Co-ReAct beats Self-Refine · SQA-CS-V2 Average [Qwen3-14B]
76.05 vs 74.22
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
RIDERS beats Self-Refine · ACC [Llama2-13B]
65.3 vs 52.4
Focus on Your Question! Interpreting and Mitigating Toxic CoT Problems in Commonsense Reasoning
PASR beats Self-Refine · Avg [Qwen2.5-7B]
61.7 vs 57.5
A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models
PASR beats Self-Refine · Avg [Qwen3-8B]
69.1 vs 65.0
A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.