CL LGFeb 19, 2024

GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

Kundan Krishna, Sanjana Ramprasad, Prakhar Gupta, Byron C. Wallace, Zachary C. Lipton, Jeffrey P. Bigham

CMU

arXiv:2402.12566v38.218 citationsh-index: 62

Originality Incremental advance

AI Analysis

This addresses the issue of factual inaccuracies in LLMs for high-stakes applications like healthcare or finance, though it is incremental as it builds on existing fact-checking methods.

The paper tackles the problem of factual errors in language model outputs when using reference documents, presenting GenAudit, a tool that suggests edits and provides evidence, which human evaluations show can detect errors in 8 different LLM outputs and improve human performance in finding errors.

LLMs can generate factually incorrect statements even when provided access to reference documents. Such errors can be dangerous in high-stakes applications (e.g., document-grounded QA for healthcare or finance). We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks. GenAudit suggests edits to the LLM response by revising or removing claims that are not supported by the reference document, and also presents evidence from the reference for facts that do appear to have support. We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users. Comprehensive evaluation by human raters shows that GenAudit can detect errors in 8 different LLM outputs when summarizing documents from diverse domains. User studies demonstrate that using GenAudit can substantially improve the performance of humans at finding errors in LLM-generated summaries. We release our tool (GenAudit) and fact-checking model for public use.

View on arXiv PDF

Similar