CLNov 14, 2024

DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in Biomedicine

Jean Seo, Jongwon Lim, Dongjun Jang, Hyopil Shin

arXiv:2411.09255v14.86 citationsh-index: 3Has Code

Originality Incremental advance

AI Analysis

This provides a domain-specific automated evaluation system for hallucination in biomedical text generation, which is incremental but addresses a known bottleneck in specialized domains.

The authors tackled the problem of evaluating hallucinations in long-form text generation within biomedicine by creating DAHL, a benchmark dataset with 8,573 questions across 29 categories, and found that larger models (up to 7-8 billion parameters) hallucinate less, but further scaling doesn't significantly improve factual accuracy.

We introduce DAHL, a benchmark dataset and automated evaluation system designed to assess hallucination in long-form text generation, specifically within the biomedical domain. Our benchmark dataset, meticulously curated from biomedical research papers, consists of 8,573 questions across 29 categories. DAHL evaluates fact-conflicting hallucinations in Large Language Models (LLMs) by deconstructing responses into atomic units, each representing a single piece of information. The accuracy of these responses is averaged to produce the DAHL Score, offering a more in-depth evaluation of hallucinations compared to previous methods that rely on multiple-choice tasks. We conduct experiments with 8 different models, finding that larger models tend to hallucinate less; however, beyond a model size of 7 to 8 billion parameters, further scaling does not significantly improve factual accuracy. The DAHL Score holds potential as an efficient alternative to human-annotated preference labels, being able to be expanded to other specialized domains. We release the dataset and code in public.

View on arXiv PDF Code

Similar