CL LGMay 28

Resolution Diagnostics for Paired LLM Evaluation

arXiv:2605.303154.9

AI Analysis

For practitioners and researchers evaluating LLMs, this work highlights a critical flaw in current leaderboard comparisons and provides a diagnostic to avoid misleading conclusions.

The paper finds that many pairwise rankings on public LLM leaderboards lack statistical resolution under paired evaluation, with up to 11/40 and 6/9 unresolved comparisons. It introduces a resolution ratio diagnostic and shows that common power calculators underestimate required sample sizes by about a factor of two.

Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9 MMLU-Pro top-10 adjacent-rank pairs are unresolved at (alpha, 1-beta) = (0.05, 0.8). The MMLU-Pro count rises to 6/9 under real subject-level clustering and stays at 5-6 out of 9 in 99.9% of category-bootstrap resamples. We frame paired LLM evaluation as a hypothesis-testing problem, invert level-alpha, power-(1-beta) tests, and report a per-pair resolution ratio q = N/N* as the primary diagnostic. A sharp small-effect expansion with an explicit second-order constant shows that the widely-used unpaired Cohen-h-plus-(1-rho) shortcut deviates from the correct N* by approximately a factor of two in the close-comparison regime, a deficit that three of five off-the-shelf calculators(Cohen 1988, G*Power, R pwr) silently inherit when the user post-multiplies their per-arm output by (1-rho). The unresolved-pair pattern remains under multiplicity correction and anytime-valid sequential testing.

View on arXiv PDF

Similar