CLAIFeb 20, 2024

TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

arXiv:2402.13249v286 citationsh-index: 16NAACL
AI Analysis

This work addresses the issue of factual consistency in dialogue summarization for NLP researchers, highlighting a domain where current LLM advancements fail to generalize, though it is incremental as it extends existing evaluation methods to a new domain.

The authors tackled the problem of factual hallucinations in topic-focused dialogue summarization by LLMs, finding that existing models produce significant factual errors regardless of size, and that specialized non-LLM metrics outperform LLM-based evaluators in detecting these errors.

Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model's size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in model-generated summaries and that non-LLM based metrics can capture all error types better than LLM-based evaluators.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes