CL AI LGJan 21, 2025

Episodic Memories Generation and Evaluation Benchmark for Large Language Models

Alexis Huet, Zied Ben Houidi, Dario Rossi

arXiv:2501.13121v117.613 citationsh-index: 13Has CodeICLR

Originality Synthesis-oriented

AI Analysis

This addresses the problem of improving AI cognition and reducing confabulations for AI researchers and developers, though it is incremental as it focuses on evaluation rather than a new solution.

The paper tackles the lack of episodic memory in Large Language Models by introducing a framework and benchmark for modeling and evaluating these capabilities, finding that even advanced models like GPT-4 struggle with tasks involving multiple events or complex spatio-temporal relationships in short contexts.

Episodic memory -- the ability to recall specific events grounded in time and space -- is a cornerstone of human cognition, enabling not only coherent storytelling, but also planning and decision-making. Despite their remarkable capabilities, Large Language Models (LLMs) lack a robust mechanism for episodic memory: we argue that integrating episodic memory capabilities into LLM is essential for advancing AI towards human-like cognition, increasing their potential to reason consistently and ground their output in real-world episodic events, hence avoiding confabulations. To address this challenge, we introduce a comprehensive framework to model and evaluate LLM episodic memory capabilities. Drawing inspiration from cognitive science, we develop a structured approach to represent episodic events, encapsulating temporal and spatial contexts, involved entities, and detailed descriptions. We synthesize a unique episodic memory benchmark, free from contamination, and release open source code and datasets to assess LLM performance across various recall and episodic reasoning tasks. Our evaluation of state-of-the-art models, including GPT-4 and Claude variants, Llama 3.1, and o1-mini, reveals that even the most advanced LLMs struggle with episodic memory tasks, particularly when dealing with multiple related events or complex spatio-temporal relationships -- even in contexts as short as 10k-100k tokens.

View on arXiv PDF Code

Similar