Assessing The Factual Accuracy of Generated Text
This addresses the need for better factual assessment in text generation, particularly for summarization, but is incremental as it builds on existing relation extraction methods.
The paper tackles the problem of evaluating factual accuracy in generated text by proposing a model-based metric, and shows it outperforms traditional metrics like ROUGE through human evaluation on a Wikipedia summarization task.
We propose a model-based metric to estimate the factual accuracy of generated text that is complementary to typical scoring schemes like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy). We introduce and release a new large-scale dataset based on Wikipedia and Wikidata to train relation classifiers and end-to-end fact extraction models. The end-to-end models are shown to be able to extract complete sets of facts from datasets with full pages of text. We then analyse multiple models that estimate factual accuracy on a Wikipedia text summarization task, and show their efficacy compared to ROUGE and other model-free variants by conducting a human evaluation study.