CLMar 4, 2024

FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction

Alessandro Scirè, Karim Ghonim, Roberto Navigli

arXiv:2403.02270v319.135 citationsh-index: 13Has CodeACL

Originality Incremental advance

AI Analysis

This addresses the need for more interpretable and efficient factuality evaluation in text summarization, particularly for long-form content, though it is an incremental improvement over existing metrics.

The authors tackled the problem of factual inconsistencies in automatically-generated summaries by proposing FENICE, a new metric based on natural language inference and claim extraction, which achieved state-of-the-art results on the AGGREFACT benchmark.

Recent advancements in text summarization, particularly with the advent of Large Language Models (LLMs), have shown remarkable performance. However, a notable challenge persists as a substantial number of automatically-generated summaries exhibit factual inconsistencies, such as hallucinations. In response to this issue, various approaches for the evaluation of consistency for summarization have emerged. Yet, these newly-introduced metrics face several limitations, including lack of interpretability, focus on short document summaries (e.g., news articles), and computational impracticality, especially for LLM-based metrics. To address these shortcomings, we propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE), a more interpretable and efficient factuality-oriented metric. FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation. Moreover, we extend our evaluation to a more challenging setting by conducting a human annotation process of long-form summarization. In the hope of fostering research in summarization factuality evaluation, we release the code of our metric and our factuality annotations of long-form summarization at https://github.com/Babelscape/FENICE.

View on arXiv PDF Code

Similar