CRAIFeb 24, 2025

Detecting Benchmark Contamination Through Watermarking

arXiv:2502.17259v26 citationsh-index: 4
Originality Incremental advance
AI Analysis

This addresses the reliability issue in LLM evaluations for researchers and practitioners, offering a practical solution to detect contamination, though it is incremental as it builds on existing watermarking and statistical methods.

The paper tackles the problem of benchmark contamination in LLM evaluations by watermarking benchmarks before release, enabling detection of contamination through a statistical test; results show maintained benchmark utility and successful detection with p-values as low as 10^-3 for performance gains like +5% on ARC-Easy.

Benchmark contamination poses a significant challenge to the reliability of Large Language Models (LLMs) evaluations, as it is difficult to assert whether a model has been trained on a test set. We introduce a solution to this problem by watermarking benchmarks before their release. The embedding involves reformulating the original questions with a watermarked LLM, in a way that does not alter the benchmark utility. During evaluation, we can detect ``radioactivity'', \ie traces that the text watermarks leave in the model during training, using a theoretically grounded statistical test. We test our method by pre-training 1B models from scratch on 10B tokens with controlled benchmark contamination, and validate its effectiveness in detecting contamination on ARC-Easy, ARC-Challenge, and MMLU. Results show similar benchmark utility post-watermarking and successful contamination detection when models are contaminated enough to enhance performance, \eg $p$-val $=10^{-3}$ for +5$\%$ on ARC-Easy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes