BenchBrowser -- Collecting Evidence for Evaluating Benchmark Validity
This addresses the issue for practitioners in AI and NLP by providing a tool to quantify and mitigate the misalignment between benchmark content and real-world use cases, though it is incremental as it builds on existing retrieval methods.
The paper tackles the problem of language model benchmarks not accurately measuring practitioner intentions by introducing BenchBrowser, a retriever that surfaces relevant evaluation items from 20 benchmark suites, validated by a human study confirming high retrieval precision to help diagnose validity gaps.
Do language model benchmarks actually measure what practitioners intend them to ? High-level metadata is too coarse to convey the granular reality of benchmarks: a "poetry" benchmark may never test for haikus, while "instruction-following" benchmarks will often test for an arbitrary mix of skills. This opacity makes verifying alignment with practitioner goals a laborious process, risking an illusion of competence even when models fail on untested facets of user interests. We introduce BenchBrowser, a retriever that surfaces evaluation items relevant to natural language use cases over 20 benchmark suites. Validated by a human study confirming high retrieval precision, BenchBrowser generates evidence to help practitioners diagnose low content validity (narrow coverage of a capability's facets) and low convergent validity (lack of stable rankings when measuring the same capability). BenchBrowser, thus, helps quantify a critical gap between practitioner intent and what benchmarks actually test.