CLCVJan 23

EMemBench: Interactive Benchmarking of Episodic Memory for VLM Agents

arXiv:2601.16690v14 citationsh-index: 32
Originality Incremental advance
AI Analysis

This work addresses the challenge of benchmarking episodic memory for VLM agents, which is incremental as it builds on existing memory evaluation methods.

The authors tackled the problem of evaluating long-term episodic memory in VLM agents by introducing EMemBench, an interactive benchmark that generates questions from agent trajectories in text and visual games, and found that induction and spatial reasoning are persistent bottlenecks, with improvements for VLM agents being inconsistent.

We introduce EMemBench, a programmatic benchmark for evaluating long-term memory of agents through interactive games. Rather than using a fixed set of questions, EMemBench generates questions from each agent's own trajectory, covering both text and visual game environments. Each template computes verifiable ground truth from underlying game signals, with controlled answerability and balanced coverage over memory skills: single/multi-hop recall, induction, temporal, spatial, logical, and adversarial. We evaluate memory agents with strong LMs/VLMs as backbones, using in-context prompting as baselines. Across 15 text games and multiple visual seeds, results are far from saturated: induction and spatial reasoning are persistent bottlenecks, especially in visual setting. Persistent memory yields clear gains for open backbones on text games, but improvements are less consistent for VLM agents, suggesting that visually grounded episodic memory remains an open challenge. A human study further confirms the difficulty of EMemBench.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes