Method Drift›Agent / long-term memory
HippoRAG 2
From RAG to Memory: Non-Parametric Continual Learning for Large Language ModelsAgent / long-term memory · first seen Feb 20, 2025
superseded — cited as a baseline and beaten by newer methods
1 papers critique it · 5 beat it on benchmarks
What papers say
Verbatim critique sentences, each from a paper that cites HippoRAG 2 as a baseline.
“However, these methods typically define cross-session relationships as simple clusters without modeling the nature of relationships or temporal evolution.”
— Pre-Storage Reasoning for Episodic Memory: Shifting Inference Burden to Memory for Personalized Dialogue
Beaten on benchmarks
Head-to-head results where a newer method reports beating HippoRAG 2. Values are copied from the source paper's tables — verify against the cited paper.
- Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces
GSW beats HippoRAG 2 · Precision [Overall Precision]
0.865 vs 0.812
- Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces
GSW beats HippoRAG 2 · Recall [Overall Recall]
0.894 vs 0.787
- Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces
GSW beats HippoRAG 2 · F1 [Overall F1]
0.850 vs 0.753
- REMem: Reasoning with Episodic Memory in Language Agent
REMem-I beats HippoRAG 2 · F1 [Complex-TR]
83.3 vs 78.2
- REMem: Reasoning with Episodic Memory in Language Agent
REMem-I beats HippoRAG 2 · BLEU-1 [Complex-TR]
77.6 vs 72.7
- REMem: Reasoning with Episodic Memory in Language Agent
REMem-I beats HippoRAG 2 · EM [Test of Time]
93.1 vs 66.9
- REMem: Reasoning with Episodic Memory in Language Agent
REMem-I beats HippoRAG 2 · F1 [LoCoMo]
42.4 vs 39.0
- REMem: Reasoning with Episodic Memory in Language Agent
REMem-I beats HippoRAG 2 · BLEU-1 [LoCoMo]
32.7 vs 30.8
- REMem: Reasoning with Episodic Memory in Language Agent
REMem-I beats HippoRAG 2 · F1 [REALTALK]
25.6 vs 21.9
- REMem: Reasoning with Episodic Memory in Language Agent
REMem-I beats HippoRAG 2 · BLEU-1 [REALTALK]
18.1 vs 16.2
- From Single to Multi-Granularity: Toward Long-Term Memory Association and Selection of Conversational Agents
MemGAS beats HippoRAG 2 · GPT4o-as-Judge [LongMemEval-s]
60.20 vs 57.60
- Pre-Storage Reasoning for Episodic Memory: Shifting Inference Burden to Memory for Personalized Dialogue
PREMem beats HippoRAG 2 · LLM-as-a-judge score [Qwen2.5-14B on LongMemEval]
64.73 vs 44.69
Newer alternatives
Recent methods in the same sub-problem, not yet superseded in the knowledge base.
- HingeMemHingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable DialoguesApr 8, 2026
- Jan 13, 2026
- Nov 25, 2025
- Generative Semantic Workspace (GSW)Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic WorkspacesNov 10, 2025
- Oct 7, 2025
- PREMemPre-Storage Reasoning for Episodic Memory: Shifting Inference Burden to Memory for Personalized DialogueSep 13, 2025