Learning to Assess the Reliability of Number-of-Runs Estimation in Stochastic Optimization
For researchers benchmarking stochastic optimization algorithms, this work provides a learning-based method to detect unreliable run-number estimates, though it is limited to within-configuration scenarios and is incremental.
The paper addresses the problem of determining when sufficient runs have been collected in stochastic optimization benchmarking to obtain reliable performance estimates. Using 132,000 Nevergrad runs, they train classifiers on 23 features to predict estimate reliability, achieving high minority-class recall in a within-configuration learning setup.
In large-scale benchmarking of stochastic optimization algorithms, the key challenge is no longer whether repeated runs are needed for reliability, but how to determine when sufficient evidence has been collected without incurring unnecessary computational cost. We study a learning-based extension of a recent empirical online heuristic that adaptively estimates the required number of runs using outlier handling and skewness-based symmetry checks. Using annotated outcomes from 132{,}000 Nevergrad runs on COCO (24 problems in 20 dimensions, 10 instances each, 11 optimizers), we train classifiers on 23 statistical, energy-free, and shape and stability features to predict whether a run-number estimate is reliable, prioritizing detection of incorrect estimates via minority-class recall. We evaluate reliability prediction using a within-configuration learning setup, where models are trained and tested on data sharing the same optimizer. The results show that run-number reliability can be learned in a within-configuration scenario, enabling detection of unreliable estimates with high minority-class recall, although performance remains limited by the restricted data diversity within fixed configurations.