CLNov 8, 2019

Transforming Wikipedia into Augmented Data for Query-Focused Summarization

Haichao Zhu, Li Dong, Furu Wei, Bing Qin, Ting Liu

arXiv:1911.03324v22.725 citationsh-index: 81

Originality Incremental advance

AI Analysis

This addresses the data scarcity problem in query-focused summarization for researchers and practitioners, though it is incremental as it builds on existing methods like BERT.

The authors tackled the limited size of query-focused summarization datasets by automatically collecting a large dataset (WIKIREF) of over 280,000 examples from Wikipedia for data augmentation, and their model achieved improved performance on DUC benchmarks after fine-tuning.

The limited size of existing query-focused summarization datasets renders training data-driven summarization models challenging. Meanwhile, the manual construction of a query-focused summarization corpus is costly and time-consuming. In this paper, we use Wikipedia to automatically collect a large query-focused summarization dataset (named WIKIREF) of more than 280, 000 examples, which can serve as a means of data augmentation. We also develop a BERT-based query-focused summarization model (Q-BERT) to extract sentences from the documents as summaries. To better adapt a huge model containing millions of parameters to tiny benchmarks, we identify and fine-tune only a sparse subnetwork, which corresponds to a small fraction of the whole model parameters. Experimental results on three DUC benchmarks show that the model pre-trained on WIKIREF has already achieved reasonable performance. After fine-tuning on the specific benchmark datasets, the model with data augmentation outperforms strong comparison systems. Moreover, both our proposed Q-BERT model and subnetwork fine-tuning further improve the model performance. The dataset is publicly available at https://aka.ms/wikiref.

View on arXiv PDF

Similar