Analysis of ELSA COVID-19 Substudy response rate using machine learning algorithms
This work addresses survey non-response prediction for national statistical organizations, but it is incremental as it applies standard ML methods to a specific dataset.
The study tackled the problem of predicting non-responses in the ELSA COVID-19 Substudy using machine learning algorithms, finding that random forest performed best in balanced accuracy, KNN in precision and test accuracy, and logistic regression in AUC.
National Statistical Organisations every year spend time and money to collect information through surveys. Some of these surveys include follow-up studies, and usually, some participants due to factors such as death, immigration, change of employment, health, etc, do not participate in future surveys. In this study, we focus on the English Longitudinal Study of Ageing (ELSA) COVID-19 Substudy, which was carried out during the COVID-19 pandemic in two waves. In this substudy, some participants from wave 1 did not participate in wave 2. Our purpose is to predict non-responses using Machine Learning (ML) algorithms such as K-nearest neighbours (KNN), random forest (RF), AdaBoost, logistic regression, neural networks (NN), and support vector classifier (SVC). We find that RF outperforms other models in terms of balanced accuracy, KNN in terms of precision and test accuracy, and logistics regressions in terms of the area under the receiver operating characteristic curve (ROC), i.e. AUC.