CLMay 22, 2023

LMGQS: A Large-scale Dataset for Query-focused Summarization

arXiv:2305.13086v1136 citations
Originality Incremental advance
AI Analysis

This provides a large-scale dataset for researchers in natural language processing working on query-focused summarization, addressing a data bottleneck in the field.

The authors tackled the lack of large-scale datasets for query-focused summarization by converting four generic summarization benchmarks into LMGQS, a dataset with over 1 million samples, and achieved state-of-the-art performance on multiple QFS benchmarks through fine-tuning.

Query-focused summarization (QFS) aims to extract or generate a summary of an input document that directly answers or is relevant to a given query. The lack of large-scale datasets in the form of documents, queries, and summaries has hindered model development in this area. In contrast, multiple large-scale high-quality datasets for generic summarization exist. We hypothesize that there is a hidden query for each summary sentence in a generic summarization annotation, and we utilize a large-scale pretrained language model to recover it. In this way, we convert four generic summarization benchmarks into a new QFS benchmark dataset, LMGQS, which consists of over 1 million document-query-summary samples. We thoroughly investigate the properties of our proposed dataset and establish baselines with state-of-the-art summarization models. By fine-tuning a language model on LMGQS, we achieve state-of-the-art zero-shot and supervised performance on multiple existing QFS benchmarks, demonstrating the high quality and diversity of LMGQS.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes