Modeling "Newsworthiness" for Lead-Generation Across Corpora
This addresses the challenge for journalists in efficiently finding story ideas from vast government records, though it is incremental as it applies an existing method to a new domain.
The paper tackled the problem of identifying newsworthy documents in large government corpora for journalists, achieving 0.93 AUC on labeled data and 0.88 AUC on expert-validated unlabeled corpora.
Journalists obtain "leads", or story ideas, by reading large corpora of government records: court cases, proposed bills, etc. However, only a small percentage of such records are interesting documents. We propose a model of "newsworthiness" aimed at surfacing interesting documents. We train models on automatically labeled corpora -- published newspaper articles -- to predict whether each article was a front-page article (i.e., \textbf{newsworthy}) or not (i.e., \textbf{less newsworthy}). We transfer these models to unlabeled corpora -- court cases, bills, city-council meeting minutes -- to rank documents in these corpora on "newsworthiness". A fine-tuned RoBERTa model achieves .93 AUC performance on heldout labeled documents, and .88 AUC on expert-validated unlabeled corpora. We provide interpretation and visualization for our models.