CLNov 14, 2024

DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form Text through a Benchmark Dataset in Biomedicine

arXiv:2411.09255v16 citationsh-index: 3
AI Analysis

This provides a domain-specific automated evaluation system for hallucination in biomedical text generation, which is incremental but addresses a known bottleneck in specialized domains.

The authors tackled the problem of evaluating hallucinations in long-form text generation within biomedicine by creating DAHL, a benchmark dataset with 8,573 questions across 29 categories, and found that larger models (up to 7-8 billion parameters) hallucinate less, but further scaling doesn't significantly improve factual accuracy.

We introduce DAHL, a benchmark dataset and automated evaluation system designed to assess hallucination in long-form text generation, specifically within the biomedical domain. Our benchmark dataset, meticulously curated from biomedical research papers, consists of 8,573 questions across 29 categories. DAHL evaluates fact-conflicting hallucinations in Large Language Models (LLMs) by deconstructing responses into atomic units, each representing a single piece of information. The accuracy of these responses is averaged to produce the DAHL Score, offering a more in-depth evaluation of hallucinations compared to previous methods that rely on multiple-choice tasks. We conduct experiments with 8 different models, finding that larger models tend to hallucinate less; however, beyond a model size of 7 to 8 billion parameters, further scaling does not significantly improve factual accuracy. The DAHL Score holds potential as an efficient alternative to human-annotated preference labels, being able to be expanded to other specialized domains. We release the dataset and code in public.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes