Method Drift

Living systematic review

LLM reasoning / chain-of-thought

Eliciting multi-step reasoning from LLMs — chain/tree-of-thought, self-consistency, process reward models, self-refinement, and test-time compute scaling.

391 papers · 892 critique receipts · 2,230 benchmark results · updated Jun 18, 2026

Most-superseded baselines

Ranked by how many distinct papers critique or beat each method. These are the standard baselines newer work routinely measures against.

  1. 1
    Chain-of-Thought· Chain-of-Thought
    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    25 papers critique it · 13 beat it on benchmarks

  2. 2
    Self-Consistency· Chain-of-Thought
    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    24 papers critique it · 9 beat it on benchmarks

  3. 3
    ToT· Chain-of-Thought
    Tree of Thoughts: Deliberate Problem Solving with Large Language Models

    9 papers critique it · 5 beat it on benchmarks

  4. 4
    ORM· ORM

    10 papers critique it · 4 beat it on benchmarks

  5. 6
    PRM· Chain-of-Thought
    PRM: Photometric Stereo based Large Reconstruction Model

    7 papers critique it · 3 beat it on benchmarks

  6. 7
    ReAct· ReAct
    ReAct: Synergizing Reasoning and Acting in Language Models

    5 papers critique it · 3 beat it on benchmarks

  7. 8
    Best-of-N· Chain-of-Thought

    3 papers critique it · 4 beat it on benchmarks

  8. 9
    Self-Refine· Chain-of-Thought
    Self-Refine: Iterative Refinement with Self-Feedback

    4 papers critique it · 3 beat it on benchmarks

  9. 12
    MCTS· MCTS

    2 papers critique it · 2 beat it on benchmarks

Sub-problems

Methods that compete on the same benchmarks cluster into distinct sub-problems.

The frontier

Recent methods not yet superseded in the knowledge base.