Living systematic review

LLM reasoning / chain-of-thought

Eliciting multi-step reasoning from LLMs — chain/tree-of-thought, self-consistency, process reward models, self-refinement, and test-time compute scaling.

391 papers · 892 critique receipts · 2,230 benchmark results · updated Jun 18, 2026

Most-superseded baselines

Ranked by how many distinct papers critique or beat each method. These are the standard baselines newer work routinely measures against.

1
Chain-of-Thought· Chain-of-Thought
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
25 papers critique it · 13 beat it on benchmarks
2
Self-Consistency· Chain-of-Thought
Self-Consistency Improves Chain of Thought Reasoning in Language Models
24 papers critique it · 9 beat it on benchmarks
3
ToT· Chain-of-Thought
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
9 papers critique it · 5 beat it on benchmarks
4
ORM· ORM
10 papers critique it · 4 beat it on benchmarks
5
Math-Shepherd· ORM
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
5 papers critique it · 8 beat it on benchmarks
6
PRM· Chain-of-Thought
PRM: Photometric Stereo based Large Reconstruction Model
7 papers critique it · 3 beat it on benchmarks
7
ReAct· ReAct
ReAct: Synergizing Reasoning and Acting in Language Models
5 papers critique it · 3 beat it on benchmarks
8
Best-of-N· Chain-of-Thought
3 papers critique it · 4 beat it on benchmarks
9
Self-Refine· Chain-of-Thought
Self-Refine: Iterative Refinement with Self-Feedback
4 papers critique it · 3 beat it on benchmarks
10
GRPO· GRPO
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
5 papers critique it · 0 beat it on benchmarks
11
H-CoT· H-CoT
H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking
2 papers critique it · 2 beat it on benchmarks
12
MCTS· MCTS
2 papers critique it · 2 beat it on benchmarks

Sub-problems

Methods that compete on the same benchmarks cluster into distinct sub-problems.

Chain-of-Thought · 178 methods

Chain-of-Thought · Self-Consistency · ToT · PRM · Best-of-N · Self-Refine

ORM · 99 methods

ORM · Math-Shepherd · EurusPRM-Stage2 · VersaPRM · Monte Carlo estimation · EurusPRM-Stage1

ReAct · 33 methods

ReAct · Transformer · Block Universal Transformer · Data Interpreter · HRM · AutoGen

DualHSIC · 36 methods

DualHSIC · LiDER · OCM · AdvST · ARGUS · MVP-Shot

MCTS · 23 methods

MCTS · PRIME · Qwen2.5-7B-Instruct · external CoT monitoring · Integration (post-hoc merging) · Monte Carlo Tree Search (MCTS) / MCTSr

single-teacher CoT distillation · 18 methods

single-teacher CoT distillation · Mixture-of-Agents and LLM-Blender (runtime ensembles) · MoT (Merge of Thought) · multi-teacher CoT aggregation · parameter merging frameworks · pruning-based CoT refinement

Outcome Reward Models · 17 methods

Outcome Reward Models · MCTS-based scoring · Neural Process Reward Models · visualprm · Gemma3-27B (Baseline) · Qwen-2.5-VL-32B (Baseline)

Diable · 16 methods

Diable · LUNA · SPACE-3 · GNNs · RAG and KG prompting · semantic parsing

VL-Rethinker-7B · 15 methods

VL-Rethinker-7B · LLaVA-CoT · Insight-V · DPO (Direct Preference Optimization) · LLaVA-o1 · PPO (Proximal Policy Optimization)

DIN-SQL · 14 methods

DIN-SQL · DAIL-SQL GPT-4 · ROUTE Qwen2.5-7B · predict SQL-only Llama-3.1-8B-Instruct · STaR-SQL · DTS-SQL Mistral-7B

The frontier

Recent methods not yet superseded in the knowledge base.