TeleMem: Building Long-Term and Multimodal Memory for Agentic AI
This work addresses the challenge of building efficient and accurate long-term memory for AI agents, particularly in multimodal contexts like gaming, though it appears incremental as it builds on existing RAG and memory frameworks.
The paper tackled the problem of limited long-term interaction and multimodal reasoning in AI agents by proposing TeleMem, a unified memory system that achieved 19% higher accuracy, 43% fewer tokens, and a 2.1x speedup compared to the state-of-the-art baseline.
Large language models (LLMs) excel at many NLP tasks but struggle to sustain long-term interactions due to limited attention over extended dialogue histories. Retrieval-augmented generation (RAG) mitigates this issue but lacks reliable mechanisms for updating or refining stored memories, leading to schema-driven hallucinations, inefficient write operations, and minimal support for multimodal reasoning.To address these challenges, we propose TeleMem, a unified long-term and multimodal memory system that maintains coherent user profiles through narrative dynamic extraction, ensuring that only dialogue-grounded information is preserved. TeleMem further introduces a structured writing pipeline that batches, retrieves, clusters, and consolidates memory entries, substantially improving storage efficiency, reducing token usage, and accelerating memory operations. Additionally, a multimodal memory module combined with ReAct-style reasoning equips the system with a closed-loop observe, think, and act process that enables accurate understanding of complex video content in long-term contexts. Experimental results show that TeleMem surpasses the state-of-the-art Mem0 baseline with 19% higher accuracy, 43% fewer tokens, and a 2.1x speedup on the ZH-4O long-term role-play gaming benchmark.