HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild
This addresses the reliability of LLMs in critical domains by providing a more realistic evaluation benchmark, though it is incremental as it builds on existing datasets and methods.
The authors tackled the problem of evaluating hallucinations in large language models (LLMs) in real-world settings by introducing HaluEval-Wild, a benchmark that uses adversarially filtered user queries from ShareGPT and GPT-4 with RAG for reference answers, resulting in a fine-grained analysis of hallucination types across various LLMs.
Hallucinations pose a significant challenge to the reliability of large language models (LLMs) in critical domains. Recent benchmarks designed to assess LLM hallucinations within conventional NLP tasks, such as knowledge-intensive question answering (QA) and summarization, are insufficient for capturing the complexities of user-LLM interactions in dynamic, real-world settings. To address this gap, we introduce HaluEval-Wild, the first benchmark specifically designed to evaluate LLM hallucinations in the wild. We meticulously collect challenging (adversarially filtered by Alpaca) user queries from ShareGPT, an existing real-world user-LLM interaction datasets, to evaluate the hallucination rates of various LLMs. Upon analyzing the collected queries, we categorize them into five distinct types, which enables a fine-grained analysis of the types of hallucinations LLMs exhibit, and synthesize the reference answers with the powerful GPT-4 model and retrieval-augmented generation (RAG). Our benchmark offers a novel approach towards enhancing our comprehension of and improving LLM reliability in scenarios reflective of real-world interactions. Our benchmark is available at https://github.com/HaluEval-Wild/HaluEval-Wild.