LG MLNov 24, 2025

When +1% Is Not Enough: A Paired Bootstrap Protocol for Evaluating Small Improvements

arXiv:2511.19794v11 citations

Originality Synthesis-oriented

AI Analysis

This addresses the issue for researchers and practitioners in ML who need reliable evaluation methods under tight compute budgets, but it is incremental as it builds on existing statistical techniques.

The paper tackles the problem of evaluating small performance improvements in machine learning, which are often reported without uncertainty estimates, by proposing a conservative evaluation protocol that uses paired multi-seed runs, BCa bootstrap confidence intervals, and sign-flip permutation tests; the result shows that single runs and unpaired t-tests often suggest significance for 0.6-2.0 point gains, but with only three seeds, their protocol never declares significance in synthetic scenarios.

Recent machine learning papers often report 1-2 percentage point improvements from a single run on a benchmark. These gains are highly sensitive to random seeds, data ordering, and implementation details, yet are rarely accompanied by uncertainty estimates or significance tests. It is therefore unclear when a reported +1-2% reflects a real algorithmic advance versus noise. We revisit this problem under realistic compute budgets, where only a few runs are affordable. We propose a simple, PC-friendly evaluation protocol based on paired multi-seed runs, bias-corrected and accelerated (BCa) bootstrap confidence intervals, and a sign-flip permutation test on per-seed deltas. The protocol is intentionally conservative and is meant as a guardrail against over-claiming. We instantiate it on CIFAR-10, CIFAR-10N, and AG News using synthetic no-improvement, small-gain, and medium-gain scenarios. Single runs and unpaired t-tests often suggest significant gains for 0.6-2.0 point improvements, especially on text. With only three seeds, our paired protocol never declares significance in these settings. We argue that such conservative evaluation is a safer default for small gains under tight budgets.

View on arXiv PDF

Similar