CLMar 26

MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization

Weizhi Zhang, Xiaokai Wei, Wei-Chieh Huang, Zheng Hui, Chen Wang, Michelle Gong, Philip S. Yu

arXiv:2603.2597332.96 citationsh-index: 13

AI Analysis

This provides a testbed for cross-domain lifelong personalization evaluation, addressing a gap in benchmarking for LLM agents, though it is incremental as it builds on existing datasets and methods.

The paper tackles the lack of benchmarks for evaluating long-context memory in LLM agents by introducing MemoryCD, a large-scale, user-centric benchmark derived from real-world Amazon Review data, which reveals that existing memory methods fall short of user satisfaction across domains.

Recent advancements in Large Language Models (LLMs) have expanded context windows to million-token scales, yet benchmarks for evaluating memory remain limited to short-session synthetic dialogues. We introduce \textsc{MemoryCD}, the first large-scale, user-centric, cross-domain memory benchmark derived from lifelong real-world behaviors in the Amazon Review dataset. Unlike existing memory datasets that rely on scripted personas to generate synthetic user data, \textsc{MemoryCD} tracks authentic user interactions across years and multiple domains. We construct a multi-faceted long-context memory evaluation pipeline of 14 state-of-the-art LLM base models with 6 memory method baselines on 4 distinct personalization tasks over 12 diverse domains to evaluate an agent's ability to simulate real user behaviors in both single and cross-domain settings. Our analysis reveals that existing memory methods are far from user satisfaction in various domains, offering the first testbed for cross-domain life-long personalization evaluation.

View on arXiv PDF

Similar