Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization
For practitioners needing to select the best LLM from a large pool, this work provides a statistically valid method to reduce evaluation cost without sacrificing accuracy.
The paper addresses the high cost of evaluating many LLMs on a benchmark by combining multi-armed bandit algorithms with low-rank matrix factorization to predict scores, and introduces doubly robust estimators that yield valid confidence intervals. Empirically, their method reduces the number of required evaluations while accurately identifying the best model.
Selecting the best large language model (LLM) for a fixed benchmark is often expensive, since exhaustive evaluation requires running every model on every example. Multi-armed bandit (MAB) algorithms can reduce the number of LLM calls by sequentially selecting the next model-example pair to evaluate, thereby avoiding wasted evaluations on clearly underperforming models. Further savings can be achieved by predicting model scores from the partially observed model-example score matrix using low-rank factorization. However, such predictions are not ground truth: they can be biased and may therefore lead to incorrect identification of the best model. In this work, we propose a principled framework that combines MAB with cheap predicted scores without compromising statistical validity. Specifically, we derive doubly robust estimators of each model's performance that use the low-rank predictions to reduce variance. This enables the construction of valid finite-sample confidence intervals in our setting, where models are selected adaptively and examples are sampled without replacement. Empirical results on real-world benchmarks show that our approach reduces the number of required evaluations, yielding meaningful savings in compute and cost while accurately identifying the best-performing model.