MLApr 21, 2022
Ultra-marginal Feature Importance: Learning from Data with Causal GuaranteesJoseph Janssen, Vincent Guan, Elina Robeva
Scientists frequently prioritize learning from data rather than training the best possible model; however, research in machine learning often prioritizes the latter. Marginal contribution feature importance (MCI) was developed to break this trend by providing a useful framework for quantifying the relationships in data. In this work, we aim to improve upon the theoretical properties, performance, and runtime of MCI by introducing ultra-marginal feature importance (UMFI), which uses dependence removal techniques from the AI fairness literature as its foundation. We first propose axioms for feature importance methods that seek to explain the causal and associative relationships in data, and we prove that UMFI satisfies these axioms under basic assumptions. We then show on real and simulated data that UMFI performs better than MCI, especially in the presence of correlated interactions and unrelated features, while partially learning the structure of the causal graph and reducing the exponential runtime of MCI to super-linear.
MLOct 30, 2024
Identifying Drift, Diffusion, and Causal Structure from Temporal SnapshotsVincent Guan, Joseph Janssen, Hossein Rahmani et al.
Stochastic differential equations (SDEs) are a fundamental tool for modelling dynamic processes, including gene regulatory networks (GRNs), contaminant transport, financial markets, and image generation. However, learning the underlying SDE from data is a challenging task, especially if individual trajectories are not observable. Motivated by burgeoning research in single-cell datasets, we present the first comprehensive approach for jointly identifying the drift and diffusion of an SDE from its temporal marginals. Assuming linear drift and additive diffusion, we prove that these parameters are identifiable from marginals if and only if the initial distribution lacks any generalized rotational symmetries. We further prove that the causal graph of any SDE with additive diffusion can be recovered from the SDE parameters. To complement this theory, we adapt entropy-regularized optimal transport to handle anisotropic diffusion, and introduce APPEX (Alternating Projection Parameter Estimation from $X_0$), an iterative algorithm designed to estimate the drift, diffusion, and causal graph of an additive noise SDE, solely from temporal marginals. We show that APPEX iteratively decreases Kullback-Leibler divergence to the true solution, and demonstrate its effectiveness on simulated data from linear additive noise SDEs.
LGApr 30, 2024
Tackling water table depth modeling via machine learning: From proxy observations to verifiabilityJoseph Janssen, Ardalan Tootchi, Ali A. Ameli
Spatial patterns of water table depth (WTD) play a crucial role in shaping ecological resilience, hydrological connectivity, and human-centric systems. Generally, a large-scale (e.g., continental or global) continuous map of static WTD can be simulated using either physically-based (PB) or machine learning-based (ML) models. We construct three fine-resolution (500 m) ML simulations of WTD, using the XGBoost algorithm and more than 20 million real and proxy observations of WTD, across the United States and Canada. The three ML models were constrained using known physical relations between WTD's drivers and WTD and were trained by sequentially adding real and proxy observations of WTD. Through an extensive (pixel-by-pixel) evaluation across the study region and within ten major ecoregions of North America, we demonstrate that our models (corr=0.6-0.75) can more accurately predict unseen real and proxy observations of WTD compared to two available PB simulations of WTD (corr=0.21-0.40). However, we still argue that currently-available large-scale simulations of static WTD could be uncertain within data-scarce regions such as steep mountainous regions. We reason that biased observational data mainly collected from low-elevation floodplains and the over-flexibility of available models can negatively affect the verifiability of large-scale simulations of WTD. Ultimately, we thoroughly discuss future directions that may help hydrogeologists decide how to improve machine learning-based WTD estimations. In particular, we advocate for the use of proxy satellite data, the incorporation of physical laws, the implementation of better model verification standards, the development of novel globally-available emergent indices, and the collection of more reliable observations.