Efficient Benchmarking Is Just Feature Selection and Multiple Regression

arXiv:2605.2577396.6Has Code

Predicted impact top 1% in ML · last 90 daysOriginality Incremental advance

AI Analysis

This work provides a simpler, more effective approach to reducing the computational cost of LLM evaluation, benefiting researchers and practitioners who need to benchmark models efficiently.

The authors reframe efficient LLM benchmarking as multiple regression with feature selection, showing that kernel ridge regression and mRMR feature selection consistently outperform existing methods in prediction error and ranking correlation across benchmarks, while being faster and more stable.

Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a subset of a benchmark's questions. By reframing this problem as an instance of multiple regression with feature selection, we find that existing efficient benchmarking methods can be greatly improved by simply using kernel ridge regression at the prediction stage. Additionally, using an information-theoretic feature-selection algorithm called minimum redundancy maximum relevance (mRMR), we can further improve upon these methods by selecting question subsets that will be maximally useful for prediction. Except in very data-poor settings, these approaches consistently achieve smaller prediction errors (in both MAE and RMSE), and greater ranking correlation between predicted and true scores (in both Spearman $ρ$ and Kendall $τ$) across a range of benchmarks using both binary and continuous metrics. Furthermore, mRMR subsampling is much faster than competitor methods (which often involve fitting probabilistic models or running clustering algorithms), and is more likely to select the same questions under different random seeds or training data splits. Tutorial code can be found at https://github.com/sambowyer/mrmr_eval .

View on arXiv PDF Code

Similar