CLOct 24, 2020

GO FIGURE: A Meta Evaluation of Factuality in Summarization

arXiv:2010.12834v2742 citations
Originality Incremental advance
AI Analysis

This work addresses the need for robust evaluation of factuality metrics in summarization, which is crucial for improving text generation reliability, but it is incremental as it builds on existing metrics.

The authors tackled the problem of evaluating factuality metrics in summarization by introducing GO FIGURE, a meta-evaluation framework with five conditions tested on diagnostic data across three tasks, revealing that QA metrics improve over standard ones but performance depends on question generation.

While neural language models can generate text with remarkable fluency and coherence, controlling for factual correctness in generation remains an open research question. This major discrepancy between the surface-level fluency and the content-level correctness of neural generation has motivated a new line of research that seeks automatic metrics for evaluating the factuality of machine text. In this paper, we introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. We propose five necessary and intuitive conditions to evaluate factuality metrics on diagnostic factuality data across three different summarization tasks. Our benchmark analysis on ten factuality metrics reveals that our meta-evaluation framework provides a robust and efficient evaluation that is extensible to multiple types of factual consistency and standard generation metrics, including QA metrics. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes