CR AIFeb 24, 2025

Detecting Benchmark Contamination Through Watermarking

Tom Sander, Pierre Fernandez, Saeed Mahloujifar, Alain Durmus, Chuan Guo

arXiv:2502.17259v216.76 citationsh-index: 22

Originality Incremental advance

AI Analysis

This addresses the reliability issue in LLM evaluations for researchers and practitioners, offering a practical solution to detect contamination, though it is incremental as it builds on existing watermarking and statistical methods.

The paper tackles the problem of benchmark contamination in LLM evaluations by watermarking benchmarks before release, enabling detection of contamination through a statistical test; results show maintained benchmark utility and successful detection with p-values as low as 10^-3 for performance gains like +5% on ARC-Easy.

Benchmark contamination poses a significant challenge to the reliability of Large Language Models (LLMs) evaluations, as it is difficult to assert whether a model has been trained on a test set. We introduce a solution to this problem by watermarking benchmarks before their release. The embedding involves reformulating the original questions with a watermarked LLM, in a way that does not alter the benchmark utility. During evaluation, we can detect ``radioactivity'', \ie traces that the text watermarks leave in the model during training, using a theoretically grounded statistical test. We test our method by pre-training 1B models from scratch on 10B tokens with controlled benchmark contamination, and validate its effectiveness in detecting contamination on ARC-Easy, ARC-Challenge, and MMLU. Results show similar benchmark utility post-watermarking and successful contamination detection when models are contaminated enough to enhance performance, \eg $p$-val $=10^{-3}$ for +5$\%$ on ARC-Easy.

View on arXiv PDF

Similar