Method Drift›Agent / long-term memory
Reflexion
Reflexion: Language Agents with Verbal Reinforcement LearningAgent / long-term memory · first seen Mar 20, 2023
superseded — cited as a baseline and beaten by newer methods
3 papers critique it · 2 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites Reflexion as a baseline.
“Matrix incorporates a unique iterative self-refinement mechanism that allows agents to systematically improve their understanding of document structures and extraction patterns.”
— Memory-Augmented Agent Training for Business Document Understanding“A Reflexion agent that accumulates thousands of verbal self-critiques is still running the same frozen model at every session; its filing cabinet grows while its capacity does not.”
— Contextual Agentic Memory is a Memo, Not True Memory“Reflexion approximates reinforcement learning by storing self-critiques from synthetic environments, but it relies on binary correctness signals from the environment rather than pre-labeled data and restricts memory retrieval to identical tasks, limiting its ability to generalize.”
— Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation
Beaten on benchmarks
Head-to-head results where a newer method reports beating Reflexion. Values are copied from the source paper's tables — verify against the cited paper.
- Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization
EMPO^2 beats Reflexion · Average [ScienceWorld]
75.9 vs 17.1
- Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization
EMPO^2 beats Reflexion · Score [WebShop]
88.3 vs 58.1
- Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization
EMPO^2 beats Reflexion · Success Rate [WebShop]
76.9 vs 28.8
- AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents
AriGraph beats Reflexion · normalized score [Treasure Hunt]
1.0 vs 0.93
- AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents
AriGraph beats Reflexion · normalized score [Cleaning]
0.79 vs 0.27
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- Feb 26, 2026