FFCI: A Framework for Interpretable Automatic Evaluation of Summarization
This work addresses the problem of comprehensive and interpretable automatic evaluation for summarization models, which is crucial for researchers and developers in natural language processing.
This paper introduces FFCI, a framework for fine-grained summarization evaluation along four dimensions: faithfulness, focus, coverage, and inter-sentential coherence. The authors developed automatic evaluation methods for each dimension using various NLP techniques and applied them to evaluate summarization models, revealing unexpected results.
In this paper, we propose FFCI, a framework for fine-grained summarization evaluation that comprises four elements: faithfulness (degree of factual consistency with the source), focus (precision of summary content relative to the reference), coverage (recall of summary content relative to the reference), and inter-sentential coherence (document fluency between adjacent sentences). We construct a novel dataset for focus, coverage, and inter-sentential coherence, and develop automatic methods for evaluating each of the four dimensions of FFCI based on cross-comparison of evaluation metrics and model-based evaluation methods, including question answering (QA) approaches, semantic textual similarity (STS), next-sentence prediction (NSP), and scores derived from 19 pre-trained language models. We then apply the developed metrics in evaluating a broad range of summarization models across two datasets, with some surprising findings.