MELGMar 10, 2023

Accounting for multiplicity in machine learning benchmark performance

arXiv:2303.07272v63 citationsh-index: 11
Originality Synthesis-oriented
AI Analysis

This addresses a methodological issue for researchers and practitioners relying on benchmark comparisons, though it is incremental as it refines existing statistical approaches.

The paper tackles the problem of overestimating state-of-the-art (SOTA) performance in machine learning benchmarks by distinguishing between the expected best performance among classifiers and the expected performance of the best classifier, showing that current methods lead to inflated estimates.

State-of-the-art (SOTA) performance refers to the highest performance achieved by some model on a test sample, preferably under controlled conditions such as public data (reproducibility) or public challenges (independent sample). Thousands of classifiers are applied, and the highest performance becomes the new reference point for a particular problem. In effect, this set-up is an estimate of the expected best performance among all classifiers applied to a random sample; a sample maximum estimate. In this paper, we argue that SOTA should instead be estimated by the expected performance of the best classifier, which can be done without knowing which classifier it is. Our contribution is the formal distinction between the two, and an investigation into the practical consequences of using the former to estimate the latter. This is done by presenting sample maximum estimator distributions for non-identical and dependent classifiers. We illustrate the impact on real world examples from public challenges.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes