AIJun 4

When Should Memory Stay Silent: Measuring Memory-Use Boundaries in Memory-Augmented Conversational Agents

arXiv:2606.0605573.2
Predicted impact top 62% in AI · last 90 daysOriginality Incremental advance
AI Analysis

For developers of memory-augmented conversational agents, this work identifies a critical gap in memory evaluation—the unwarranted integration of sensitive memories—and provides a measurement framework to assess it.

The paper introduces RBI-Eval to measure when retrieved sensitive memories are unnecessarily integrated into conversational agent responses, finding that memory access causes substantial behavioral divergence (8.9%–26.6% separation decrease for GPT-5.4-mini, 51.1%–82.9% for other models) compared to no-memory baselines, and that retrieval systems reduce exposure but not integration.

Long-term memory enables language model agents to support personalized interactions, but it remains unclear when available memories warrant integration into responses. Existing memory evaluations emphasize retrieval accuracy and downstream task utility, while overlooking whether retrieved sensitive memory content is warranted in the current turn. We introduce RBI-Eval, a controlled measurement study built around a probe set that compares model behavior with and without access to sensitive memory under identical benign prompts. We evaluate four base LLMs against a matched no-memory reference across four memory-access settings: full-context exposure and three retrieval systems. Our results reveal substantial behavioral divergence. With memory available, the separation score for sensitive-memory integration decreases by 8.9\%--26.6\% relative to the matched no-memory reference for GPT-5.4-mini, but by 51.1\%--82.9\% for Claude-Sonnet-4.6, DeepSeek-V4-Flash, and Qwen3.5-9B. Control experiments on DeepSeek and GPT-5.4-mini show this effect is specific to sensitive content, rather than general personalization. Retrieval systems reduce exposure but do not eliminate integration once sensitive memory reaches the generator. These findings suggest safe personalization requires memory-aware decisions at both retrieval and generation time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes