From Outliers to Topics in Language Models: Anticipating Trends in News Corpora
This work addresses the challenge of anticipating trends in dynamic news data for researchers and analysts, though it is incremental as it builds on existing topic modeling and outlier detection methods.
The paper tackled the problem of identifying emerging topics in news corpora by analyzing outliers as weak signals, using vector embeddings and cumulative clustering on French and English datasets, and found that outliers consistently evolve into coherent topics over time.
This paper examines how outliers, often dismissed as noise in topic modeling, can act as weak signals of emerging topics in dynamic news corpora. Using vector embeddings from state-of-the-art language models and a cumulative clustering approach, we track their evolution over time in French and English news datasets focused on corporate social responsibility and climate change. The results reveal a consistent pattern: outliers tend to evolve into coherent topics over time across both models and languages.