CLSIMay 20, 2022

Forecasting COVID-19 Caseloads Using Unsupervised Embedding Clusters of Social Media Posts

arXiv:2205.10408v1628 citationsh-index: 47
Originality Incremental advance
AI Analysis

This work addresses infectious disease modelling in areas with unreliable epidemiological data, though it is incremental as it applies existing transformer methods to a new domain.

The authors tackled the problem of forecasting COVID-19 caseloads by using unsupervised clustering of social media posts to extract features, which outperformed other feature types in predicting upward trend signals and were integrated into a transformer-based time-series model for forecasting.

We present a novel approach incorporating transformer-based language models into infectious disease modelling. Text-derived features are quantified by tracking high-density clusters of sentence-level representations of Reddit posts within specific US states' COVID-19 subreddits. We benchmark these clustered embedding features against features extracted from other high-quality datasets. In a threshold-classification task, we show that they outperform all other feature types at predicting upward trend signals, a significant result for infectious disease modelling in areas where epidemiological data is unreliable. Subsequently, in a time-series forecasting task we fully utilise the predictive power of the caseload and compare the relative strengths of using different supplementary datasets as covariate feature sets in a transformer-based time-series model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes