Forecasting COVID-19 Caseloads Using Unsupervised Embedding Clusters of Social Media Posts
This work addresses infectious disease modelling in areas with unreliable epidemiological data, though it is incremental as it applies existing transformer methods to a new domain.
The authors tackled the problem of forecasting COVID-19 caseloads by using unsupervised clustering of social media posts to extract features, which outperformed other feature types in predicting upward trend signals and were integrated into a transformer-based time-series model for forecasting.
We present a novel approach incorporating transformer-based language models into infectious disease modelling. Text-derived features are quantified by tracking high-density clusters of sentence-level representations of Reddit posts within specific US states' COVID-19 subreddits. We benchmark these clustered embedding features against features extracted from other high-quality datasets. In a threshold-classification task, we show that they outperform all other feature types at predicting upward trend signals, a significant result for infectious disease modelling in areas where epidemiological data is unreliable. Subsequently, in a time-series forecasting task we fully utilise the predictive power of the caseload and compare the relative strengths of using different supplementary datasets as covariate feature sets in a transformer-based time-series model.