CLApr 20

QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks

Taylor Lundy, Narun K. Raman, Kevin Leyton-Brown

arXiv:2604.1784261.9h-index: 5

AI Analysis

For LLM evaluators, QuickScope provides a more efficient way to identify model weaknesses in dynamic benchmarks, though it is an incremental improvement over existing Bayesian optimization methods.

This paper introduces QuickScope, a method for efficiently identifying hard questions in dynamic LLM benchmarks by adapting the COUP Bayesian optimization algorithm. Experiments show it discovers difficult questions more sample-efficiently than baselines while reducing false positives.

LLM benchmarks are increasingly dynamic: instead of containing a fixed set of questions, they define templates and parameters that can generate an effectively unlimited number of question variants. This flexibility is valuable, but it makes evaluation expensive -- especially when the goal is not just determining an average score, but reliably identifying a model's weak spots. This paper introduces a new methodology for identifying hard questions in dynamic benchmarks. It leverages COUP, a recent Bayesian optimization algorithm (Graham, Velez & Leyton-Brown, 2026), after introducing several substantive modifications to make the algorithm suitable for practical LLM pipelines. We also wrap it in a tool that supports flexible choices of datasets and utility functions, enabling users to target the kinds of questions they care about (e.g., low-accuracy questions; questions that are unusually hard relative to their measured complexity). In experiments across a range of benchmarks, we show that our method, dubbed $\texttt{QuickScope}$, discovers truly difficult questions more sample efficiently than standard baselines, while also reducing false positives from noisy outcomes.

View on arXiv PDF

Similar