From Noise to Signal: When Outliers Seed New Topics
This work addresses the challenge of early topic detection in news analysis, though it is incremental as it builds on existing topic modeling and outlier detection methods.
The paper tackled the problem of outliers in dynamic topic modeling by showing that some can signal emerging topics, introducing a temporal taxonomy to classify document trajectories and evaluating it on a French news corpus, revealing a small subset of high-consensus anticipatory outliers.
Outliers in dynamic topic modeling are typically treated as noise, yet we show that some can serve as early signals of emerging topics. We introduce a temporal taxonomy of news-document trajectories that defines how documents relate to topic formation over time. It distinguishes anticipatory outliers, which precede the topics they later join, from documents that either reinforce existing topics or remain isolated. By capturing these trajectories, the taxonomy links weak-signal detection with temporal topic modeling and clarifies how individual articles anticipate, initiate, or drift within evolving clusters. We implement it in a cumulative clustering setting using document embeddings from eleven state-of-the-art language models and evaluate it retrospectively on HydroNewsFr, a French news corpus on the hydrogen economy. Inter-model agreement reveals a small, high-consensus subset of anticipatory outliers, increasing confidence in these labels. Qualitative case studies further illustrate these trajectories through concrete topic developments.