Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory

arXiv:2605.2963014.8Has Code

AI Analysis

For researchers evaluating agent-memory retrievers, this protocol provides a rigorous method to attribute performance gains to embedders rather than lexical artifacts, revealing nuanced trade-offs between encoder capacity and query type.

The paper proposes entity-collision, a protocol that isolates retrieval lift from lexical leakage by ensuring distractors share entity tokens with the answer, and stratifies queries by discriminator tag. Applied to agent-memory benchmarks, it reveals that encoder capacity alone is not the binding constraint, with MiniLM-384 outperforming a larger BGE-large on lexical queries.

End-to-end agent-memory benchmarks report a single hit@k per retriever, confounding lexical leakage (uncontrolled query/gold/distractor entity overlap) with tag-mixing (preferences, services, tools averaged together). We propose entity-collision, a system-agnostic protocol that pins the BM25 floor by construction -- every distractor shares the answer's entity tokens -- and stratifies queries by discriminator tag, so any lift over BM25 is attributable to the embedder. Applied to an open-source agent-memory testbed across 5 tags x 3 embedders x 5 collision degrees with paired-bootstrap 95% CIs, the protocol reveals a two-axis pattern: a 256-d hash trigram helps only on closed-vocabulary lexical tags at deep collision; MiniLM-384 dominates both axes; and a 2.7x-parameter BGE-large does not uniformly improve on MiniLM -- it wins on intent-style queries but loses on lexical ones. Encoder capacity alone is not the binding constraint. The synthetic intent-tag null replicates on LongMemEval (n=500) as a single-session-preference recall cliff. Adaptive vector-weight routing on LoCoMo is a measured null: 11.7pp of oracle headroom exists, but no signal we tested recovers it. All 26 result tables and 37 reproduce scripts are version-controlled and verified by a public registry; the protocol is exercised on a deterministically governed memory testbed (event-sourced decision log, DAG-state-machine schema lifecycle) so every reported CI is reproducible byte-for-byte from the ingest stream.

View on arXiv PDF

Similar