CL AI LGApr 25, 2025

Evaluating Evaluation Metrics -- The Mirage of Hallucination Detection

Atharva Kulkarni, Yuan Zhang, Joel Ruben Antony Moniz, Xiou Ge, Bo-Hsiang Tseng, Dhivya Piraviperumal, Swabha Swayamdipta, Hong Yu

CMU

arXiv:2504.18114v212.012 citationsh-index: 15EMNLP

Originality Incremental advance

AI Analysis

This work addresses the challenge of reliably measuring hallucinations in language models, which is crucial for their safe deployment, but it is incremental as it evaluates existing metrics rather than proposing new ones.

The paper tackled the problem of evaluating hallucination detection metrics for language models, finding that current metrics often misalign with human judgments and show inconsistent improvements with model scaling, while LLM-based evaluation with GPT-4 performed best.

Hallucinations pose a significant obstacle to the reliability and widespread adoption of language models, yet their accurate measurement remains a persistent challenge. While many task- and domain-specific metrics have been proposed to assess faithfulness and factuality concerns, the robustness and generalization of these metrics are still untested. In this paper, we conduct a large-scale empirical evaluation of 6 diverse sets of hallucination detection metrics across 4 datasets, 37 language models from 5 families, and 5 decoding methods. Our extensive investigation reveals concerning gaps in current hallucination evaluation: metrics often fail to align with human judgments, take an overtly myopic view of the problem, and show inconsistent gains with parameter scaling. Encouragingly, LLM-based evaluation, particularly with GPT-4, yields the best overall results, and mode-seeking decoding methods seem to reduce hallucinations, especially in knowledge-grounded settings. These findings underscore the need for more robust metrics to understand and quantify hallucinations, and better strategies to mitigate them.

View on arXiv PDF

Similar