99.7AIJun 3
Agents' Last ExamYiyou Sun, Xinyang Han, Weichen Zhang et al.
Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.
COMP-PHDec 29, 2023
Learning About Structural Errors in Models of Complex Dynamical SystemsJin-Long Wu, Matthew E. Levine, Tapio Schneider et al.
Complex dynamical systems are notoriously difficult to model because some degrees of freedom (e.g., small scales) may be computationally unresolvable or are incompletely understood, yet they are dynamically important. For example, the small scales of cloud dynamics and droplet formation are crucial for controlling climate, yet are unresolvable in global climate models. Semi-empirical closure models for the effects of unresolved degrees of freedom often exist and encode important domain-specific knowledge. Building on such closure models and correcting them through learning the structural errors can be an effective way of fusing data with domain knowledge. Here we describe a general approach, principles, and algorithms for learning about structural errors. Key to our approach is to include structural error models inside the models of complex systems, for example, in closure models for unresolved scales. The structural errors then map, usually nonlinearly, to observable data. As a result, however, mismatches between model output and data are only indirectly informative about structural errors, due to a lack of labeled pairs of inputs and outputs of structural error models. Additionally, derivatives of the model may not exist or be readily available. We discuss how structural error models can be learned from indirect data with derivative-free Kalman inversion algorithms and variants, how sparsity constraints enforce a "do no harm" principle, and various ways of modeling structural errors. We also discuss the merits of using non-local and/or stochastic error models. In addition, we demonstrate how data assimilation techniques can assist the learning about structural errors in non-ergodic systems. The concepts and algorithms are illustrated in two numerical examples based on the Lorenz-96 system and a human glucose-insulin model.
COMP-PHApr 22, 2024
Toward Routing River Water in Land Surface Models with Recurrent Neural NetworksMauricio Lima, Katherine Deck, Oliver R. A. Dunbar et al.
Machine learning is playing an increasing role in hydrology, supplementing or replacing physics-based models. One notable example is the use of recurrent neural networks (RNNs) for forecasting streamflow given observed precipitation and geographic characteristics. Training of such a model over the continental United States (CONUS) has demonstrated that a single set of model parameters can be used across independent catchments, and that RNNs can outperform physics-based models. In this work, we take a next step and study the performance of RNNs for river routing in land surface models (LSMs). Instead of observed precipitation, the LSM-RNN uses instantaneous runoff calculated from physics-based models as an input. We train the model with data from river basins spanning the globe and test it using historical streamflow measurements. The model demonstrates skill at generalization across basins (predicting streamflow in catchments not used in training) and across time (predicting streamflow during years not used in training). We compare the predictions from the LSM-RNN to an existing physics-based model calibrated with a similar dataset and find that the LSM-RNN outperforms the physics-based model: a gain in median NSE from 0.56 to 0.64 (time-split experiment) and from 0.30 to 0.34 (basin-split experiment). Our results show that RNNs are effective for global streamflow prediction from runoff inputs and motivate the development of complete routing models that can capture nested sub-basis connections.
LGDec 11, 2024
Regional climate risk assessment from climate models using probabilistic machine learningZhong Yi Wan, Ignacio Lopez-Gomez, Robert Carver et al.
Accurate, actionable climate information at km scales is crucial for robust natural hazard risk assessment and infrastructure planning. Simulating climate at these resolutions remains intractable, forcing reliance on downscaling: either physics-based or statistical methods that transform climate simulations from coarse to impact-relevant resolutions. One major challenge for downscaling is to comprehensively capture the interdependency among climate processes of interest, a prerequisite for representing climate hazards. However, current approaches either lack the desired scalability or are bespoke to specific types of hazards. We introduce GenFocal, a computationally efficient, general-purpose, end-to-end generative framework that gives rise to full probabilistic characterizations of complex climate processes interacting at fine spatiotemporal scales. GenFocal more accurately assesses extreme risk in the current climate than leading approaches, including one used in the US 5th National Climate Assessment. It produces plausible tracks of tropical cyclones, providing accurate statistics of their genesis and evolution, even when they are absent from the corresponding climate simulations. GenFocal also shows compelling results that are consistent with the literature on projecting climate impact on decadal timescales. GenFocal revolutionizes how climate simulations can be efficiently augmented with observations and harnessed to enable future climate impact assessments at the spatiotemporal scales relevant to local and regional communities. We believe this work establishes genAI as an effective paradigm for modeling complex, high-dimensional multivariate statistical correlations that have deterred precise quantification of climate risks associated with hazards such as wildfires, extreme heat, tropical cyclones, and flooding; thereby enabling the evaluation of adaptation strategies.
LGDec 3, 2024
A Physics-Constrained Neural Differential Equation Framework for Data-Driven Snowpack SimulationAndrew Charbonneau, Katherine Deck, Tapio Schneider
This paper presents a physics-constrained neural differential equation framework for parameterization, and employs it to model the time evolution of seasonal snow depth given hydrometeorological forcings. When trained on data from multiple SNOTEL sites, the parameterization predicts daily snow depth with under 9% median error and Nash Sutcliffe Efficiencies over 0.94 across a wide variety of snow climates. The parameterization also generalizes to new sites not seen during training, which is not often true for calibrated snow models. Requiring the parameterization to predict snow water equivalent in addition to snow depth only increases error to ~12%. The structure of the approach guarantees the satisfaction of physical constraints, enables these constraints during model training, and allows modeling at different temporal resolutions without additional retraining of the parameterization. These benefits hold potential in climate modeling, and could extend to other dynamical systems with physical constraints.
MLAug 31, 2017
Earth System Modeling 2.0: A Blueprint for Models That Learn From Observations and Targeted High-Resolution SimulationsTapio Schneider, Shiwei Lan, Andrew Stuart et al.
Climate projections continue to be marred by large uncertainties, which originate in processes that need to be parameterized, such as clouds, convection, and ecosystems. But rapid progress is now within reach. New computational tools and methods from data assimilation and machine learning make it possible to integrate global observations and local high-resolution simulations in an Earth system model (ESM) that systematically learns from both. Here we propose a blueprint for such an ESM. We outline how parameterization schemes can learn from global observations and targeted high-resolution simulations, for example, of clouds and convection, through matching low-order statistics between ESMs, observations, and high-resolution simulations. We illustrate learning algorithms for ESMs with a simple dynamical system that shares characteristics of the climate system; and we discuss the opportunities the proposed framework presents and the challenges that remain to realize it.