CLAug 4, 2025

Test Set Quality in Multilingual LLM Evaluation

arXiv:2508.02635v22 citationsh-index: 20
Originality Synthesis-oriented
AI Analysis

This work addresses dataset quality issues for researchers and practitioners in multilingual LLM evaluation, though it is incremental as it builds on prior error identification efforts.

The paper tackled the problem of low-quality multilingual benchmark datasets by manually analyzing errors in French and Telugu test sets, finding that corrections led to performance differences of almost 10% in some cases across LLMs.

Several multilingual benchmark datasets have been developed in a semi-automatic manner in the recent past to measure progress and understand the state-of-the-art in the multilingual capabilities of Large Language Models. However, there is not a lot of attention paid to the quality of the datasets themselves, despite the existence of previous work in identifying errors in even fully human-annotated test sets. In this paper, we manually analyze recent multilingual evaluation sets in two languages - French and Telugu, identifying several errors in the process. We compare the performance difference across several LLMs with the original and revised versions of the datasets and identify large differences (almost 10% in some cases) in both languages). Based on these results, we argue that test sets should not be considered immutable and should be revisited, checked for correctness, and potentially versioned. We end with some recommendations for both the dataset creators as well as consumers on addressing the dataset quality issues.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes