MemFail: Stress-Testing Failure Modes of LLM Memory Systems
For researchers and developers of LLM agents, MemFail provides a systematic way to diagnose memory system failures, moving beyond black-box accuracy benchmarks.
MemFail introduces a diagnostic benchmark to isolate and test specific failure modes of LLM memory systems, evaluating four state-of-the-art systems across five adversarially designed datasets. The results reveal tradeoffs in memory system architectures, enabling attribution of errors to specific operations like summarization, storage, or retrieval.
Large language model (LLM) agents increasingly rely on external memory systems to remain consistent across long-horizon interactions, but little empirical work has been done to understand the specific failure modes and design choices that these systems present. Existing benchmarks report aggregate question-answering accuracy and treat memory systems as black boxes, making it impossible to attribute an incorrect answer to a particular failure mode of the system. We introduce MemFail, a diagnostic benchmark that isolates the failure modes of modern LLM memory systems. We begin by formalizing memory systems as the composition of three canonical operations -- summarization, storage, and retrieval -- and identify the potential failure modes induced by each. Based on these hypothesized failure modes, we construct five datasets spanning four tasks, each adversarially designed to test a specific operation of a memory system. Using these datasets, we evaluate four state-of-the-art memory systems on MemFail and demonstrate how MemFail can be used to empirically understand the tradeoffs induced by differences in memory system architectures.