Living systematic review
LLM reasoning / chain-of-thought
Eliciting multi-step reasoning from LLMs — chain/tree-of-thought, self-consistency, process reward models, self-refinement, and test-time compute scaling.
391 papers · 892 critique receipts · 2,230 benchmark results · updated Jun 18, 2026
Most-superseded baselines
Ranked by how many distinct papers critique or beat each method. These are the standard baselines newer work routinely measures against.
- 1Chain-of-Thought· Chain-of-ThoughtChain-of-Thought Prompting Elicits Reasoning in Large Language Models
25 papers critique it · 13 beat it on benchmarks
- 2Self-Consistency· Chain-of-ThoughtSelf-Consistency Improves Chain of Thought Reasoning in Language Models
24 papers critique it · 9 beat it on benchmarks
- 3ToT· Chain-of-ThoughtTree of Thoughts: Deliberate Problem Solving with Large Language Models
9 papers critique it · 5 beat it on benchmarks
- 5Math-Shepherd· ORMMath-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
5 papers critique it · 8 beat it on benchmarks
- 6PRM· Chain-of-ThoughtPRM: Photometric Stereo based Large Reconstruction Model
7 papers critique it · 3 beat it on benchmarks
- 7ReAct· ReActReAct: Synergizing Reasoning and Acting in Language Models
5 papers critique it · 3 beat it on benchmarks
- 9Self-Refine· Chain-of-ThoughtSelf-Refine: Iterative Refinement with Self-Feedback
4 papers critique it · 3 beat it on benchmarks
- 10GRPO· GRPODeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
5 papers critique it · 0 beat it on benchmarks
- 11H-CoT· H-CoTH-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking
2 papers critique it · 2 beat it on benchmarks
Sub-problems
Methods that compete on the same benchmarks cluster into distinct sub-problems.
Chain-of-Thought · 178 methods
Chain-of-Thought · Self-Consistency · ToT · PRM · Best-of-N · Self-Refine
ORM · 99 methods
ORM · Math-Shepherd · EurusPRM-Stage2 · VersaPRM · Monte Carlo estimation · EurusPRM-Stage1
ReAct · 33 methods
ReAct · Transformer · Block Universal Transformer · Data Interpreter · HRM · AutoGen
MCTS · 23 methods
MCTS · PRIME · Qwen2.5-7B-Instruct · external CoT monitoring · Integration (post-hoc merging) · Monte Carlo Tree Search (MCTS) / MCTSr
single-teacher CoT distillation · 18 methods
single-teacher CoT distillation · Mixture-of-Agents and LLM-Blender (runtime ensembles) · MoT (Merge of Thought) · multi-teacher CoT aggregation · parameter merging frameworks · pruning-based CoT refinement
Outcome Reward Models · 17 methods
Outcome Reward Models · MCTS-based scoring · Neural Process Reward Models · visualprm · Gemma3-27B (Baseline) · Qwen-2.5-VL-32B (Baseline)
Diable · 16 methods
Diable · LUNA · SPACE-3 · GNNs · RAG and KG prompting · semantic parsing
VL-Rethinker-7B · 15 methods
VL-Rethinker-7B · LLaVA-CoT · Insight-V · DPO (Direct Preference Optimization) · LLaVA-o1 · PPO (Proximal Policy Optimization)
DIN-SQL · 14 methods
DIN-SQL · DAIL-SQL GPT-4 · ROUTE Qwen2.5-7B · predict SQL-only Llama-3.1-8B-Instruct · STaR-SQL · DTS-SQL Mistral-7B
The frontier
Recent methods not yet superseded in the knowledge base.
- Jun 9, 2026
- Jun 8, 2026
- Jun 7, 2026
- STaR-QuantSTaR-Quant: State-Time Consistent Post-Training Quantization for Diffusion Large Language ModelsJun 3, 2026
- Jun 3, 2026
- May 29, 2026
- May 28, 2026
- May 27, 2026
- May 27, 2026
- Tree-of-ThoughtsTree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design PatternsMay 27, 2026
- May 27, 2026
- From Simulation to EnactionFrom Simulation to Enaction: Post-trained language models recognize and react to their own generationsMay 25, 2026