Neural Total Variation Distance Estimators for Changepoint Detection in News Data
It addresses the problem of changepoint detection in high-dimensional, noisy news data for journalism, policy analysis, and crisis monitoring, though it is incremental as it adapts an existing method to a new domain.
The paper tackled detecting shifts in public discourse from news data by using neural networks to estimate total variation distance between content distributions, successfully identifying major events like 9/11 and COVID-19 with minimal domain knowledge.
Detecting when public discourse shifts in response to major events is crucial for understanding societal dynamics. Real-world data is high-dimensional, sparse, and noisy, making changepoint detection in this domain a challenging endeavor. In this paper, we leverage neural networks for changepoint detection in news data, introducing a method based on the so-called learning-by-confusion scheme, which was originally developed for detecting phase transitions in physical systems. We train classifiers to distinguish between articles from different time periods. The resulting classification accuracy is used to estimate the total variation distance between underlying content distributions, where significant distances highlight changepoints. We demonstrate the effectiveness of this method on both synthetic datasets and real-world data from The Guardian newspaper, successfully identifying major historical events including 9/11, the COVID-19 pandemic, and presidential elections. Our approach requires minimal domain knowledge, can autonomously discover significant shifts in public discourse, and yields a quantitative measure of change in content, making it valuable for journalism, policy analysis, and crisis monitoring.