CLOct 29, 2024

FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation

Farima Fatahi Bayat, Lechen Zhang, Sheza Munir, Lu Wang

arXiv:2410.22257v213.226 citationsh-index: 5ACL

Originality Incremental advance

AI Analysis

This addresses the problem of assessing factuality in language models for researchers and developers, though it is incremental as it builds on existing evaluation methods with a new dataset and pipeline.

The authors tackled the problem of evaluating language model factuality in real-world interactions by developing VERIFY, a pipeline that categorizes LM-generated content as supported, unsupported, or undecidable based on web-retrieved evidence, and found it correlates better with human evaluations than existing methods. They used VERIFY to create FACTBENCH, a dataset of 1K prompts across 150 topics, and benchmarked models showing proprietary models have better factuality, Llama3.1-405B-Instruct has comparable or lower precision than Llama3.1-70B-Instruct, and Gemini1.5-Pro has a 25% over-refusal rate.

The rapid adoption of language models (LMs) across diverse applications has raised concerns about their factuality, i.e., their consistency with real-world facts. We first present VERIFY (Verification and Evidence RetrIeval for FactualitY evaluation), a pipeline to evaluate LMs' factuality in real-world user interactions. VERIFY considers the verifiability of LM-generated content and categorizes content units as supported, unsupported, or undecidable based on Web-retrieved evidence. Importantly, factuality judgment by VERIFY correlates better with human evaluations than existing methods. Using VERIFY, we identify "hallucination prompts" across diverse topics, i.e., those eliciting the highest rates of incorrect (unsupported) and inconclusive (undecidable) LM responses. These prompts form FACTBENCH, a dataset of 1K prompts across 150 fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and can be regularly updated with new prompts. We benchmark widely-used LMs from GPT, Gemini, and Llama families on FACTBENCH, yielding the following key findings: (i) Proprietary models exhibit better factuality, with decreased performance from Easy to Hard hallucination prompts. (ii) Llama3.1-405B-Instruct shows comparable or lower factual precision than Llama3.1-70B-Instruct across all evaluation methods due to its higher subjectivity that leads to more content labeled as undecidable. (iii) Gemini1.5-Pro shows a significantly higher refusal rate, with over-refusal in 25% of cases.

View on arXiv PDF

Similar