CLAIMar 17

Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

arXiv:2603.1640652.9h-index: 4
AI Analysis

This highlights a critical problem for researchers and practitioners in low/medium-resource languages, as it calls for improved evaluation methods to ensure reliable benchmarking.

The paper identifies severe flaws in existing benchmarks for Icelandic Large Language Models (LLMs) due to unverified synthetic or machine-translated data, which skew results and undermine validity, with quantitative error analysis showing clear differences compared to human-authored benchmarks.

This paper evaluates current Large Language Model (LLM) benchmarking for Icelandic, identifies problems, and calls for improved evaluation methods in low/medium-resource languages in particular. We show that benchmarks that include synthetic or machine-translated data that have not been verified in any way, commonly contain severely flawed test examples that are likely to skew the results and undermine the tests' validity. We warn against the use of such methods without verification in low/medium-resource settings as the translation quality can, at best, only be as good as MT quality for a given language at any given time. Indeed, the results of our quantitative error analysis on existing benchmarks for Icelandic show clear differences between human-authored/-translated benchmarks vs. synthetic or machine-translated benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes