CLFeb 26, 2024

Benchmarking LLMs on the Semantic Overlap Summarization Task

John Salvador, Naman Bansal, Mousumi Akter, Souvika Sarkar, Anupam Das, Shubhra Kanti Karmaker

arXiv:2402.17008v22.72 citationsh-index: 6

Originality Synthesis-oriented

AI Analysis

This work provides a benchmarking study for LLMs on a constrained summarization task, which is incremental as it applies existing methods to new data.

The authors benchmarked popular Large Language Models (LLMs) on the Semantic Overlap Summarization (SOS) task, introducing a new dataset and evaluating over 900,000 summaries across domains, with human evaluation on 540 samples.

Semantic Overlap Summarization (SOS) is a constrained multi-document summarization task, where the constraint is to capture the common/overlapping information between two alternative narratives. In this work, we perform a benchmarking study of popular Large Language Models (LLMs) exclusively on the SOS task. Additionally, we introduce the PrivacyPolicyPairs (3P) dataset to expand the space of SOS benchmarks in terms of quantity and variety. This dataset provides 135 high-quality SOS data samples sourced from privacy policy documents. We then use a standard prompting taxonomy called TELeR to create and evaluate 905,216 distinct LLM-generated summaries over two SOS datasets from different domains, and we further conduct human evaluation on a subset of 540 samples. We conclude the paper by analyzing models' performances and the reliability of automatic evaluation. The code and datasets used to conduct this study are available at https://anonymous.4open.science/r/llm_eval-E16D.

View on arXiv PDF

Similar