CLFeb 26, 2024

Benchmarking LLMs on the Semantic Overlap Summarization Task

arXiv:2402.17008v22 citationsh-index: 6
Originality Synthesis-oriented
AI Analysis

This work provides a benchmarking study for LLMs on a constrained summarization task, which is incremental as it applies existing methods to new data.

The authors benchmarked popular Large Language Models (LLMs) on the Semantic Overlap Summarization (SOS) task, introducing a new dataset and evaluating over 900,000 summaries across domains, with human evaluation on 540 samples.

Semantic Overlap Summarization (SOS) is a constrained multi-document summarization task, where the constraint is to capture the common/overlapping information between two alternative narratives. In this work, we perform a benchmarking study of popular Large Language Models (LLMs) exclusively on the SOS task. Additionally, we introduce the PrivacyPolicyPairs (3P) dataset to expand the space of SOS benchmarks in terms of quantity and variety. This dataset provides 135 high-quality SOS data samples sourced from privacy policy documents. We then use a standard prompting taxonomy called TELeR to create and evaluate 905,216 distinct LLM-generated summaries over two SOS datasets from different domains, and we further conduct human evaluation on a subset of 540 samples. We conclude the paper by analyzing models' performances and the reliability of automatic evaluation. The code and datasets used to conduct this study are available at https://anonymous.4open.science/r/llm_eval-E16D.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes