Uncertainty-Aware Estimation of Mis/Disinformation Prevalence on Social Media
This work addresses the need for robust prevalence estimation to inform mitigation strategies, but it is incremental as it applies existing uncertainty quantification methods to a specific domain.
The study tackled the problem of estimating mis/disinformation prevalence on social media by quantifying uncertainties from sampling, annotation, and data retrieval, using a multi-platform dataset from 2025. The results showed that keyword-based data retrieval can increase variability, leading to wider confidence intervals in prevalence estimates.
Estimation of mis/disinformation prevalence in social media is crucial for designing mitigation strategies to limit its impact. Yet, such estimations are subject to several uncertainties that are rarely quantified jointly. In this study, we present a methodological contribution in which confidence intervals were used to quantify uncertainties related to mis/disinformation prevalence. The analysis draws on a multi-platform, multilingual dataset annotated by professional fact-checkers. Data were collected between March and April 2025 from Facebook, Instagram, LinkedIn, TikTok, X/Twitter, and YouTube across four EU Member States (France, Poland, Slovakia, and Spain). We account for different causes of uncertainty: (i) sample uncertainty, (ii) annotation uncertainty arising from human disagreement and misclassification, and (iii) data retrieval uncertainty induced by keyword-based data collection. First, we estimate the uncertainty arising from the different causes separately using confidence intervals, simulation-based methods, and bootstrapping. Finally, we combined multinomial simulations of annotator behaviour with keyword and post-resampling to capture the joint impact of measurement uncertainty on mis/disinformation prevalence estimates. The proposed methodological approach highlights the importance of uncertainty-aware estimation of mis/disinformation prevalence for robust analysis. The empirical results of this study show that keyword-based data retrieval can exceed baseline variability, leading to wider confidence intervals around prevalence estimates.