CLFeb 5, 2025

Minerva: A Programmable Memory Test Benchmark for Language Models

arXiv:2502.03358v210 citationsh-index: 8ICML
AI Analysis

This provides a more detailed and actionable benchmark for researchers and developers to pinpoint specific memory-related capabilities that models lack, though it is incremental as it builds on existing evaluation methods.

The paper tackles the problem of evaluating how effectively language models use their memory by introducing a framework for automatically generating comprehensive tests, which extends beyond existing benchmarks to include atomic and composite tasks for interpretable assessment.

How effectively can LLM-based AI assistants utilize their memory (context) to perform various tasks? Traditional data benchmarks, which are often manually crafted, suffer from several limitations: they are static, susceptible to overfitting, difficult to interpret, and lack actionable insights--failing to pinpoint the specific capabilities a model lacks when it does not pass a test. In this paper, we present a framework for automatically generating a comprehensive set of tests to evaluate models' abilities to use their memory effectively. Our framework extends the range of capability tests beyond the commonly explored (passkey, key-value, needle in the haystack) search, a dominant focus in the literature. Specifically, we evaluate models on atomic tasks such as searching, recalling, editing, matching, comparing information in context memory, performing basic operations when inputs are structured into distinct blocks, and maintaining state while operating on memory, simulating real-world data. Additionally, we design composite tests to investigate the models' ability to perform more complex, integrated tasks. Our benchmark enables an interpretable, detailed assessment of memory capabilities of LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes