LLM Benchmark Datasets Should Be Contamination-Resistant

Ali Al-Lawati, Jason Lucas, Dongwon Lee, Suhang Wang

arXiv:2605.1999925.5

Predicted impact top 15% in LG · last 90 daysOriginality Highly original

AI Analysis

It addresses the problem of benchmark contamination for the LLM evaluation community, proposing a new paradigm for dataset design.

The paper argues that LLM benchmark datasets should be contamination-resistant (unlearnable but usable for inference) to ensure reliable evaluation, and outlines properties and methods to achieve this, calling for community adoption.

Benchmark datasets are critical for reproducible, reliable, and discriminative evaluation of LLMs. However, recent studies reveal that many benchmark datasets are included in pretraining corpora, i.e., $\textit{contaminated}$, which diminishes their value as reliable measures of model generalization. In this paper, we argue that benchmark datasets should be $\textit{contamination-resistant}$, i.e., $\textit{unlearnable}$, but support $\textit{inference}$. To accomplish this, we first highlight the wide prevalence of benchmark dataset contamination and outline the properties of contamination-resistant datasets. Second, we highlight how the asymmetry between the inference and training pipelines in the Transformer architecture can be leveraged to support contamination-resistance. Third, we outline mathematical advancements to make these datasets interoperable across various LLM architectures. Based on the above, we call on the community to ensure the reliability of LLM benchmarking by: (i) advancing novel contamination-resistant methodologies, (ii) developing supporting methods and platforms, and (iii) adopting contamination-resistant benchmarks into existing evaluation pipelines.

View on arXiv PDF

Similar