A Data-Driven Supervised Machine Learning Approach to Estimating Global Ambient Air Pollution Concentrations With Associated Prediction Intervals
This addresses the challenge of sparse and gappy air pollution monitoring data for stakeholders needing detailed outdoor air pollution assessments, though it is incremental as it builds on existing data-driven methods.
The paper tackles the problem of estimating global ambient air pollution concentrations by developing a scalable, supervised machine learning framework to impute missing temporal and spatial measurements, resulting in a comprehensive dataset with fine granularity and prediction intervals for pollutants like NO2, O3, PM10, PM2.5, and SO2.
Global ambient air pollution, a transboundary challenge, is typically addressed through interventions relying on data from spatially sparse and heterogeneously placed monitoring stations. These stations often encounter temporal data gaps due to issues such as power outages. In response, we have developed a scalable, data-driven, supervised machine learning framework. This model is designed to impute missing temporal and spatial measurements, thereby generating a comprehensive dataset for pollutants including NO$_2$, O$_3$, PM$_{10}$, PM$_{2.5}$, and SO$_2$. The dataset, with a fine granularity of 0.25$^{\circ}$ at hourly intervals and accompanied by prediction intervals for each estimate, caters to a wide range of stakeholders relying on outdoor air pollution data for downstream assessments. This enables more detailed studies. Additionally, the model's performance across various geographical locations is examined, providing insights and recommendations for strategic placement of future monitoring stations to further enhance the model's accuracy.