Benchmarks Saturate When The Model Gets Smarter Than The Judge

Marthe Ballon, Andres Algaba, Brecht Verbeken, Vincent Ginis

arXiv:2601.19532v16.02 citationsh-index: 5

Originality Incremental advance

AI Analysis

This work addresses the critical issue of unreliable evaluation for AI researchers, though it is incremental as it builds on existing datasets and judges.

The authors tackled the problem of inaccurate benchmarks for Large Language Models by creating Omni-MATH-2, a manually revised dataset that reduces noise, and found that judge errors, particularly with Omni-Judge being wrong in 96.4% of disagreements, can mask model differences before benchmark saturation.

Benchmarks are important tools to track progress in the development of Large Language Models (LLMs), yet inaccuracies in datasets and evaluation methods consistently undermine their effectiveness. Here, we present Omni-MATH-2, a manually revised version of the Omni-MATH dataset comprising a clean, exact-answer subset ($n{=}4181$) and a tagged, non-standard subset ($n{=}247$). Each problem was audited to ensure LaTeX compilability, solvability and verifiability, which involved adding missing figures or information, labeling problems requiring a proof, estimation or image, and removing clutter. This process significantly reduces dataset-induced noise, thereby providing a more precise assessment of model performance. The annotated dataset also allows us to evaluate judge-induced noise by comparing GPT-5 mini with the original Omni-Judge, revealing substantial discrepancies between judges on both the clean and tagged problem subsets. Expert annotations reveal that Omni-Judge is wrong in $96.4\%$ of the judge disagreements, indicating its inability to differentiate between models' abilities, even well before saturation of the benchmark occurs. As problems become more challenging, we find that increasingly competent judges become essential in order to prevent judge errors from masking genuine differences between models. Finally, neither judge identifies the present failure modes for the subset of tagged problems, demonstrating that dataset quality and judge reliability are both critical to develop accurate benchmarks of model performance.

View on arXiv PDF

Similar