CLFeb 1, 2021

Counting Protests in News Articles: A Dataset and Semi-Automated Data Collection Pipeline

arXiv:2102.00917v1
Originality Synthesis-oriented
AI Analysis

This work addresses the challenge of event extraction for civic decision-making by providing a dataset and pipeline for researchers in NLP, though it is incremental as it builds on existing methods for domain detection and slot filling.

The paper tackled the problem of extracting structured data on protests from news articles by releasing a manually labeled dataset of 42,347 protest events in the U.S. from 2017 to 2021 and describing a semi-automated pipeline, with an LSTM classifier benchmarked to demonstrate utility in counting events.

Between January 2017 and January 2021, thousands of local news sources in the United States reported on over 42,000 protests about topics such as civil rights, immigration, guns, and the environment. Given the vast number of local journalists that report on protests daily, extracting these events as structured data to understand temporal and geographic trends can empower civic decision-making. However, the task of extracting events from news articles presents well known challenges to the NLP community in the fields of domain detection, slot filling, and coreference resolution. To help improve the resources available for extracting structured data from news stories, our contribution is three-fold. We 1) release a manually labeled dataset of news article URLs, dates, locations, crowd size estimates, and 494 discrete descriptive tags corresponding to 42,347 reported protest events in the United States between January 2017 and January 2021; 2) describe the semi-automated data collection pipeline used to discover, sort, and review the 144,568 English articles that comprise the dataset; and 3) benchmark a long-short term memory (LSTM) low dimensional classifier that demonstrates the utility of processing news articles based on syntactic structures, such as paragraphs and sentences, to count the number of reported protest events.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes