IRAICLNov 15, 2020

Open4Business(O4B): An Open Access Dataset for Summarizing Business Documents

arXiv:2011.07636v31 citations
AI Analysis

This addresses the problem of data scarcity for researchers and practitioners in business document summarization, though it is incremental as it focuses on dataset creation rather than novel methods.

The authors tackled the lack of large-scale domain-specific datasets for automatic summarization by introducing Open4Business (O4B), a dataset of 17,458 open access business articles with reference summaries, and showed that models trained on O4B achieve comparable performance to those trained on a 7x larger non-open access dataset.

A major challenge in fine-tuning deep learning models for automatic summarization is the need for large domain specific datasets. One of the barriers to curating such data from resources like online publications is navigating the license regulations applicable to their re-use, especially for commercial purposes. As a result, despite the availability of several business journals there are no large scale datasets for summarizing business documents. In this work, we introduce Open4Business(O4B),a dataset of 17,458 open access business articles and their reference summaries. The dataset introduces a new challenge for summarization in the business domain, requiring highly abstractive and more concise summaries as compared to other existing datasets. Additionally, we evaluate existing models on it and consequently show that models trained on O4B and a 7x larger non-open access dataset achieve comparable performance on summarization. We release the dataset, along with the code which can be leveraged to similarly gather data for multiple domains.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes