CL AIMay 1, 2025

HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection

Deanna Emery, Michael Goitia, Freddie Vargus, Iulia Neagu

arXiv:2505.00506v13 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses the critical challenge of hallucination detection for real-world applications like Retrieval Augmented Generation, though it is incremental as it builds on existing benchmarks by improving diversity and scope.

The paper tackled the problem of detecting hallucinated content in large language models by introducing the HalluMix Benchmark, a diverse, task-agnostic dataset, and found that Quotient Detections achieved the best performance with an accuracy of 0.82 and an F1 score of 0.84.

As large language models (LLMs) are increasingly deployed in high-stakes domains, detecting hallucinated content$\unicode{x2013}$text that is not grounded in supporting evidence$\unicode{x2013}$has become a critical challenge. Existing benchmarks for hallucination detection are often synthetically generated, narrowly focused on extractive question answering, and fail to capture the complexity of real-world scenarios involving multi-document contexts and full-sentence outputs. We introduce the HalluMix Benchmark, a diverse, task-agnostic dataset that includes examples from a range of domains and formats. Using this benchmark, we evaluate seven hallucination detection systems$\unicode{x2013}$both open and closed source$\unicode{x2013}$highlighting differences in performance across tasks, document lengths, and input representations. Our analysis highlights substantial performance disparities between short and long contexts, with critical implications for real-world Retrieval Augmented Generation (RAG) implementations. Quotient Detections achieves the best overall performance, with an accuracy of 0.82 and an F1 score of 0.84.

View on arXiv PDF

Similar