LGAIMLJul 11, 2025

Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery

arXiv:2507.08977v25 citationsh-index: 2
Originality Highly original
AI Analysis

This provides a unified framework for using simulations as training data to improve predictive performance and interpretability in scientific domains like epidemiology, chemistry, and ecology.

The paper tackles the tradeoff between mechanistic models (scientifically grounded but limited) and machine learning models (predictive but data-hungry and uninterpretable) by introducing Simulation-Grounded Neural Networks (SGNNs), which use mechanistic simulations as training data. Results show SGNNs nearly tripled COVID-19 forecasting skill versus CDC baselines, reduced chemical yield prediction error by one-third, and enabled accurate inference for unobservable targets like COVID-19 transmissibility.

Scientific modeling faces a tradeoff: mechanistic models provide scientific grounding but struggle with real-world complexity, while machine learning models achieve strong predictive performance but require large labeled datasets and are not interpretable. We introduce Simulation-Grounded Neural Networks (SGNNs), which use mechanistic simulations as training data for neural networks. SGNNs are pretrained on synthetic corpora spanning diverse model structures, parameter regimes, stochasticity, and observational artifacts. Simulation-grounded learning has been applied in multiple domains (e.g., surrogate models in physics, forecasting in epidemiology). We provide a unified framework for simulation-grounded learning and evaluated SGNNs across scientific disciplines and modeling tasks. We found that SGNNs were successful across domains: for prediction tasks, they nearly tripled COVID-19 forecasting skill versus CDC baselines, reduced chemical yield prediction error by one-third, and maintained accuracy in ecological forecasting where task-specific models failed. For inference tasks, SGNNs also accurately classified the source of information spread in simulated social networks and enabled supervised learning for unobservable targets, such as estimating COVID-19 transmissibility more accurately than traditional methods even in early outbreaks. Finally, SGNNs enable back-to-simulation attribution, a form of mechanistic interpretability. Back-to-simulation attribution matches real-world observations to the training simulations the model considers most similar, identifying which mechanistic processes the model believes best explain the observed data. By providing a unified framework for simulation-grounded learning, we establish when and how mechanistic simulations can serve as effective training data for robust, interpretable scientific inference.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes