LG AIOct 30, 2025

SAFE: A Novel Approach to AI Weather Evaluation through Stratified Assessments of Forecasts over Earth

arXiv:2510.26099v1Has Code

Originality Incremental advance

AI Analysis

This addresses the need for more equitable and detailed evaluation in weather forecasting, particularly for stakeholders in diverse regions, though it is incremental as it builds on existing benchmarking methods by adding stratification.

The paper tackles the problem of evaluating AI weather models using only global average metrics, which ignore geographic and socioeconomic disparities, by introducing the SAFE package for stratified performance assessment across different attributes like countries and income levels, revealing that all tested state-of-the-art models exhibit significant skill disparities across these strata.

The dominant paradigm in machine learning is to assess model performance based on average loss across all samples in some test set. This amounts to averaging performance geospatially across the Earth in weather and climate settings, failing to account for the non-uniform distribution of human development and geography. We introduce Stratified Assessments of Forecasts over Earth (SAFE), a package for elucidating the stratified performance of a set of predictions made over Earth. SAFE integrates various data domains to stratify by different attributes associated with geospatial gridpoints: territory (usually country), global subregion, income, and landcover (land or water). This allows us to examine the performance of models for each individual stratum of the different attributes (e.g., the accuracy in every individual country). To demonstrate its importance, we utilize SAFE to benchmark a zoo of state-of-the-art AI-based weather prediction models, finding that they all exhibit disparities in forecasting skill across every attribute. We use this to seed a benchmark of model forecast fairness through stratification at different lead times for various climatic variables. By moving beyond globally-averaged metrics, we for the first time ask: where do models perform best or worst, and which models are most fair? To support further work in this direction, the SAFE package is open source and available at https://github.com/N-Masi/safe

View on arXiv PDF Code

Similar