CLMay 22, 2025

The Case for Repeatable, Open, and Expert-Grounded Hallucination Benchmarks in Large Language Models

Justin D. Norman, Michael U. Rivera, D. Alex Hughes

arXiv:2505.17345v21 citationsh-index: 2

Originality Synthesis-oriented

AI Analysis

This addresses the problem of unreliable hallucination measurement for researchers and practitioners, but it is incremental as it builds on existing concerns without introducing a new method.

The paper argues that current hallucination benchmarks for large language models lack validity and practical utility when experts are not involved in data creation, proposing a taxonomy and case study to advocate for repeatable, open, and expert-grounded benchmarks.

Plausible, but inaccurate, tokens in model-generated text are widely believed to be pervasive and problematic for the responsible adoption of language models. Despite this concern, there is little scientific work that attempts to measure the prevalence of language model hallucination in a comprehensive way. In this paper, we argue that language models should be evaluated using repeatable, open, and domain-contextualized hallucination benchmarking. We present a taxonomy of hallucinations alongside a case study that demonstrates that when experts are absent from the early stages of data creation, the resulting hallucination metrics lack validity and practical utility.

View on arXiv PDF

Similar