CLOct 21, 2024

Contamination Report for Multilingual Benchmarks

Microsoft
arXiv:2410.16186v12 citationsh-index: 10
Originality Synthesis-oriented
AI Analysis

This addresses the problem of unreliable multilingual evaluation for AI researchers and practitioners, highlighting a critical issue in benchmark integrity.

The study investigated contamination of multilingual benchmarks in large language models, finding that nearly all tested models showed signs of contamination across most benchmarks, which can inflate evaluation scores.

Benchmark contamination refers to the presence of test datasets in Large Language Model (LLM) pre-training or post-training data. Contamination can lead to inflated scores on benchmarks, compromising evaluation results and making it difficult to determine the capabilities of models. In this work, we study the contamination of popular multilingual benchmarks in LLMs that support multiple languages. We use the Black Box test to determine whether $7$ frequently used multilingual benchmarks are contaminated in $7$ popular open and closed LLMs and find that almost all models show signs of being contaminated with almost all the benchmarks we test. Our findings can help the community determine the best set of benchmarks to use for multilingual evaluation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes