CLAIMay 31

Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue

arXiv:2606.0122379.8
AI Analysis

For researchers in long-context modeling and dialogue systems, this work addresses the lack of benchmarks for reflective memory, which is crucial for tasks requiring inference beyond factual recall.

The paper introduces RefMem-Bench, a benchmark for reflective memory in long-horizon dialogue with 26K QA instances, and proposes REMIND, a hierarchical framework that improves answer accuracy and memory recall by synthesizing fragmented cues into high-level interpretations.

Despite substantial progress in long-context modeling, existing benchmarks remain confined to factual memory for explicit recall, failing to measure the reflective memory required to synthesize fragmented, multimodal cues into high-level interpretations. To address this gap, we introduce RefMem-Bench, a benchmark for reflective memory in long-horizon dialogue. RefMem-Bench contains 26K annotated QA instances with eight reflective-memory dimensions and three task formats, requiring models to move beyond surface-level retrieval and infer latent meanings from evidence distributed across interaction histories. To enhance reflective memory capability, we propose REflective Memory INDuction (REMIND), a hierarchical framework that treats reflective memory as progressive meaning construction. REMIND couples question-conditioned evidence retrieval, salience-aware grounding, and abstraction-level supervision, and uses Progressive Reflective Alignment to distill high-level reflective reasoning into the factual inference pathway. Experiments show RefMem-Bench poses a substantial challenge to current models, while REMIND consistently improves both answer accuracy and memory recall through progressive evidence perception, grounding, and abstraction.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes