CL AIApr 11, 2025

MedHal: An Evaluation Dataset for Medical Hallucination Detection

Gaya Mehenni, Fabrice Lamarche, Odette Rios-Ibacache, John Kildea, Amal Zouaq

arXiv:2504.08596v22 citationsh-index: 2

Originality Synthesis-oriented

AI Analysis

This work addresses the critical need for reliable hallucination detection in medical AI to prevent disastrous consequences, though it is incremental as it builds on existing datasets and methods.

The paper tackles the problem of detecting hallucinations in medical texts by introducing MedHal, a large-scale dataset that addresses limitations of existing datasets, and demonstrates its utility with a baseline model showing improvements over general-purpose approaches.

We present MedHal, a novel large-scale dataset specifically designed to evaluate if models can detect hallucinations in medical texts. Current hallucination detection methods face significant limitations when applied to specialized domains like medicine, where they can have disastrous consequences. Existing medical datasets are either too small, containing only a few hundred samples, or focus on a single task like Question Answering or Natural Language Inference. MedHal addresses these gaps by: (1) incorporating diverse medical text sources and tasks; (2) providing a substantial volume of annotated samples suitable for training medical hallucination detection models; and (3) including explanations for factual inconsistencies to guide model learning. We demonstrate MedHal's utility by training and evaluating a baseline medical hallucination detection model, showing improvements over general-purpose hallucination detection approaches. This resource enables more efficient evaluation of medical text generation systems while reducing reliance on costly expert review, potentially accelerating the development of medical AI research.

View on arXiv PDF

Similar