Is Reflexion superseded?

Reflexion (Agent / long-term memory): superseded — cited as a baseline and beaten by newer methods. 3 paper(s) critique it, 2 beat it on benchmarks — #11 of 63 most-superseded. Sub-problem: cluster led by Reflexion. Newer alternatives in the same sub-problem include EMPO^2.

Method Drift›Agent / long-term memory

Superseded baseline#11 of 63 most-superseded

Reflexion

Reflexion: Language Agents with Verbal Reinforcement Learning

Agent / long-term memory · first seen Mar 20, 2023

superseded — cited as a baseline and beaten by newer methods

3 papers critique it · 2 beat it on benchmarks

What papers say

Verbatim critique sentences, each from a paper that cites Reflexion as a baseline.

“Matrix incorporates a unique iterative self-refinement mechanism that allows agents to systematically improve their understanding of document structures and extraction patterns.”
— Memory-Augmented Agent Training for Business Document Understanding
“A Reflexion agent that accumulates thousands of verbal self-critiques is still running the same frozen model at every session; its filing cabinet grows while its capacity does not.”
— Contextual Agentic Memory is a Memo, Not True Memory
“Reflexion approximates reinforcement learning by storing self-critiques from synthetic environments, but it relies on binary correctness signals from the environment rather than pre-labeled data and restricts memory retrieval to identical tasks, limiting its ability to generalize.”
— Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation

Beaten on benchmarks

Head-to-head results where a newer method reports beating Reflexion. Values are copied from the source paper's tables — verify against the cited paper.

EMPO^2 beats Reflexion · Average [ScienceWorld]
75.9 vs 17.1
Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization
EMPO^2 beats Reflexion · Score [WebShop]
88.3 vs 58.1
Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization
EMPO^2 beats Reflexion · Success Rate [WebShop]
76.9 vs 28.8
Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization
AriGraph beats Reflexion · normalized score [Treasure Hunt]
1.0 vs 0.93
AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents
AriGraph beats Reflexion · normalized score [Cleaning]
0.79 vs 0.27
AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents

Newer alternatives

Recent methods in the same sub-problem, not yet superseded in the knowledge base.

EMPO^2 Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization
Feb 26, 2026