CLDLApr 19, 2021

NewsEdits: A Dataset of Revision Histories for News Articles (Technical Report: Data Processing)

arXiv:2104.09647v24 citations
AI Analysis

This dataset addresses a gap for researchers in linguistics and social sciences by providing the first large-scale, multilingual corpus of news revision histories, though it is incremental as it focuses on data collection rather than novel methods.

The authors tackled the lack of publicly available news article revision histories by presenting NewsEdits, a dataset containing 1,278,804 articles with 4,609,430 versions, including 10.9 million added, 8.9 million changed, and 6.8 million removed sentences, and 72 million atomic edits.

News article revision histories have the potential to give us novel insights across varied fields of linguistics and social sciences. In this work, we present, to our knowledge, the first publicly available dataset of news article revision histories, or NewsEdits. Our dataset is multilingual; it contains 1,278,804 articles with 4,609,430 versions from over 22 English- and French-language newspaper sources based in three countries. Across version pairs, we count 10.9 million added sentences; 8.9 million changed sentences and 6.8 million removed sentences. Within the changed sentences, we derive 72 million atomic edits. NewsEdits is, to our knowledge, the largest corpus of revision histories of any domain.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes