CLLGMay 23, 2023

USB: A Unified Summarization Benchmark Across Tasks and Domains

arXiv:2305.14296v2133 citations
Originality Synthesis-oriented
AI Analysis

This addresses the lack of comprehensive benchmarks for controlled and reliable summarization in NLP, though it is incremental as it builds on existing summarization work.

The authors introduced a unified summarization benchmark with rich annotations across 8 tasks and 6 domains, finding that fine-tuned models outperform larger few-shot models on multiple tasks and that human-labeled data is more effective than heuristics for factuality tasks.

While the NLP community has produced numerous summarization benchmarks, none provide the rich annotations required to simultaneously address many important problems related to control and reliability. We introduce a Wikipedia-derived benchmark, complemented by a rich set of crowd-sourced annotations, that supports $8$ interrelated tasks: (i) extractive summarization; (ii) abstractive summarization; (iii) topic-based summarization; (iv) compressing selected sentences into a one-line summary; (v) surfacing evidence for a summary sentence; (vi) predicting the factual accuracy of a summary sentence; (vii) identifying unsubstantiated spans in a summary sentence; (viii) correcting factual errors in summaries. We compare various methods on this benchmark and discover that on multiple tasks, moderately-sized fine-tuned models consistently outperform much larger few-shot prompted language models. For factuality-related tasks, we also evaluate existing heuristics to create training data and find that training on them results in worse performance than training on $20\times$ less human-labeled data. Our articles draw from $6$ domains, facilitating cross-domain analysis. On some tasks, the amount of training data matters more than the domain where it comes from, while for other tasks training specifically on data from the target domain, even if limited, is more beneficial.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes