Auditing LLM Benchmarks with Item Response Theory
This work addresses the problem of mislabeled examples in LLM benchmarks, which can lead to inaccurate model evaluations and potentially contaminated reward models, impacting researchers and developers who rely on these benchmarks.
This paper introduces an Item Response Theory-based indicator to identify mislabels in LLM benchmarks, achieving 95% precision in the top 200 examples across seven benchmarks. It also reveals that reward models specialize in stylistic preference and identifies a frontier reward model that agrees with detected mislabels at 78% accuracy.
LLM benchmark labels are frozen at release and silently propagated into downstream benchmarks, errors and all. We introduce an Item Response Theory-based indicator that surfaces likely mislabels at 95% precision in the top 200 examples across seven preference and multiple-choice benchmarks using responses from 114 models, outperforming a supervised classifier. We trace these errors to mechanical labeling heuristics, upstream annotation mistakes inherited unchanged from source datasets, and fundamentally ambiguous items without a defensible single label. The same model fit reveals that reward models specialize in stylistic preference rather than factual knowledge, and identifies one frontier reward model that agrees with detected mislabels at 78% accuracy versus 38% for its peers, consistent with benchmark contamination or benchmark-specific over-optimization.