CLGNJun 30, 2023

A Massive Scale Semantic Similarity Dataset of Historical English

arXiv:2306.17810v25 citationsh-index: 8
Originality Synthesis-oriented
AI Analysis

This dataset enables applications like studying semantic change across space and time, but it is incremental as it extends existing semantic similarity resources to a historical domain.

The authors tackled the lack of large-scale historical semantic similarity datasets by creating HEADLINES, a dataset with nearly 400 million positive pairs from U.S. local newspapers spanning 1920 to 1989, using digitized articles and headlines to capture semantic similarity over time.

A diversity of tasks use language models trained on semantic similarity data. While there are a variety of datasets that capture semantic similarity, they are either constructed from modern web data or are relatively small datasets created in the past decade by human annotators. This study utilizes a novel source, newly digitized articles from off-copyright, local U.S. newspapers, to assemble a massive-scale semantic similarity dataset spanning 70 years from 1920 to 1989 and containing nearly 400M positive semantic similarity pairs. Historically, around half of articles in U.S. local newspapers came from newswires like the Associated Press. While local papers reproduced articles from the newswire, they wrote their own headlines, which form abstractive summaries of the associated articles. We associate articles and their headlines by exploiting document layouts and language understanding. We then use deep neural methods to detect which articles are from the same underlying source, in the presence of substantial noise and abridgement. The headlines of reproduced articles form positive semantic similarity pairs. The resulting publicly available HEADLINES dataset is significantly larger than most existing semantic similarity datasets and covers a much longer span of time. It will facilitate the application of contrastively trained semantic similarity models to a variety of tasks, including the study of semantic change across space and time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes