CLAIMar 23

MemGround: Long-Term Memory Evaluation Kit for Large Language Models in Gamified Scenarios

arXiv:2604.1415864.9h-index: 10
AI Analysis

For researchers evaluating LLM memory, this provides a more realistic and comprehensive benchmark, though it is an incremental improvement over existing static evaluations.

Current LLM long-term memory evaluations are static and neglect dynamic tracking and hierarchical reasoning. MemGround introduces a gamified benchmark with a three-tier framework and multi-dimensional metrics, revealing that SOTA LLMs struggle with sustained dynamic tracking and complex reasoning in interactive environments.

Current evaluations of long-term memory in LLMs are fundamentally static. By fixating on simple retrieval and short-context inference, they neglect the multifaceted nature of complex memory systems, such as dynamic state tracking and hierarchical reasoning in continuous interactions. To overcome these limitations, we propose MemGround, a rigorous long-term memory benchmark natively grounded in rich, gamified interactive scenarios. To systematically assess these capabilities, MemGround introduces a three-tier hierarchical framework that evaluates Surface State Memory, Temporal Associative Memory, and Reasoning-Based Memory through specialized interactive tasks. Furthermore, to comprehensively quantify both memory utilization and behavioral trajectories, we propose a multi-dimensional metric suite comprising Question-Answer Score (QA Overall), Memory Fragments Unlocked (MFU), Memory Fragments with Correct Order (MFCO), and Exploration Trajectory Diagrams (ETD). Extensive experiments reveal that state-of-the-art LLMs and memory agents still struggle with sustained dynamic tracking, temporal event association, and complex reasoning derived from long-term accumulated evidence in interactive environments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes