IRMay 22

Same Ranking, Different Winner: How Scoring Targets Shape LLM Memory Benchmarks

arXiv:2605.2406065.2

Predicted impact top 46% in IR · last 90 daysOriginality Incremental advance

AI Analysis

For researchers evaluating conversational memory systems, this work reveals that an implicit design choice (scoring target) can silently reverse conclusions about which memory architecture is best.

The authors show that the choice of which stored memory form receives retrieval credit (scoring target) can change benchmark conclusions, with nDCG changing on 83.4–94.0% of queries and flipping architecture rankings. They propose TIAP, an audit method, and find that relaxed source-linked credit is justified only 29.2% of the time.

Conversational-memory systems increasingly transform dialogue history into facts, summaries, timelines, and other source-linked descendants, so a single source turn can coexist with several derived memories in the same retrieval index. This raises an underspecified evaluation question: which stored form should receive retrieval credit? We show that this scoring-target choice is often left implicit and can materially change benchmark conclusions. We present TIAP, a fixed-output audit that rescores saved ranked outputs under three targets -- Raw, Source, and Canonical -- without rerunning retrieval. On LoCoMo and LongMemEval-S, switching only the credited target changes nDCG on 83.4--94.0 percent of shared queries, flips target orderings on Mem0 and MemoryOS transfer runs, and reverses parser-density recommendations. A 1,902-case semantic audit further shows that relaxed source-linked credit is fully justified only 29.2 percent of the time, despite high rubric reliability in a validation subset. These results reveal target noninvariance: conclusions about memory architectures can silently flip with a single benchmark-design choice. Conversational-memory papers should therefore define and report the scoring target explicitly.

View on arXiv PDF

Similar