Instrumented data for causal scientific machine learning

arXiv:2606.07865h-index: 3
Originality Highly original
AI Analysis

For researchers in scientific ML, this paper introduces a new data paradigm that addresses the fundamental limitation of observational and synthetic data by providing mechanistically supervised, causally interpretable data.

The paper proposes instrumented data, where each datum carries its mechanistic model, uncertainty, and counterfactuals, as a new data paradigm for scientific machine learning. It argues this approach is operationally feasible via V&V image-to-simulation pipelines and can improve validation, auditing, and surrogate training across multiple scientific domains.

Scientific machine learning is limited less by model size than by the data it is trained on. Observational data records what happened but not why; template synthetic data has a known generating process but only for the simulator's template, not the case a user faces. We argue a third option is now operationally feasible: instrumented data, in which every datum carries the mechanistic model that produced it, an explicit uncertainty over that model, and an executable family of counterfactuals. Verification-and-validation (V&V) instrumented image-to-simulation pipelines are one realisation: a sensor observation becomes a fully specified, solver-backed simulation with explicit, editable parameters and a propagated aleatoric/epistemic uncertainty. The substrate is case-specific, mechanistically supervised, and supports causal interventions through Pearl's do-operator. Near-term consequences for validation, auditing, and surrogate training span computational biology, climate, materials, fluid mechanics, and medical imaging; a longer-term, falsifiable implication concerns foundation models for scientific reasoning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes