LG AIMay 17

State-of-the-Art Claims Require State-of-the-Art Evidence

arXiv:2605.1727347.8

Predicted impact top 53% in LG · last 90 daysOriginality Incremental advance

AI Analysis

For AI/ML researchers and practitioners, the paper highlights the need for more honest and precise reporting of benchmark results to avoid overclaiming state-of-the-art performance.

The paper identifies a widespread claim-evidence gap in AI benchmarking, showing that in over half of top-model comparisons on public leaderboards, at least one assumed property of superiority (e.g., meaningful effect size, consistency, robustness) does not hold, with aggregate gains often driven by outlier datasets.

State-of-the-Art (SOTA) claims pervade Artificial Intelligence (AI) and Machine Learning (ML) research. These claims rest on benchmark evaluations, where models are ranked by aggregate scores across tasks. Public benchmarks or leaderboards are the most visible instance, but the same structure appears in paper tables throughout the literature. However, such minimal evidence often cannot support these strong claims. We identify a widespread claim-evidence gap in AI benchmarking. Claiming SOTA carries implicit assumptions beyond mean score superiority, suggesting that a model meaningfully outperforms alternatives across most tasks. However, a marginal improvement in the mean score merely indicates a top average rank rather than true superiority. Analyzing ten cross-domain benchmarks from public leaderboards, we found that in more than half of top-model comparisons, at least one commonly assumed property of superiority does not hold. These properties include meaningful effect size, consistency across tasks, or robustness to dataset removal. Instead, aggregate gains are frequently driven by outlier datasets. This fragility persists even in benchmarks with many tasks. We argue that claim language should reflect the strength of the underlying evidence. This requires no additional experiments, only honest reporting of what results actually show, enabling more precise and interpretable comparisons across models.

View on arXiv PDF

Similar