CLOct 27, 2023

NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark

Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, Eneko Agirre

arXiv:2310.18018v132.4365 citationsh-index: 10

Originality Synthesis-oriented

AI Analysis

This addresses a critical issue in NLP evaluation that can mislead research and publication outcomes, though it is a position paper proposing solutions rather than presenting new empirical results.

The paper argues that data contamination in NLP benchmarks, where LLMs are trained on test data, leads to overestimated performance and harmful scientific conclusions, and calls for community efforts to measure and flag such contamination.

In this position paper, we argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble. The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the same benchmark. The extent of the problem is unknown, as it is not straightforward to measure. Contamination causes an overestimation of the performance of a contaminated model in a target benchmark and associated task with respect to their non-contaminated counterparts. The consequences can be very harmful, with wrong scientific conclusions being published while other correct ones are discarded. This position paper defines different levels of data contamination and argues for a community effort, including the development of automatic and semi-automatic measures to detect when data from a benchmark was exposed to a model, and suggestions for flagging papers with conclusions that are compromised by data contamination.

View on arXiv PDF

Similar