MEFeb 20, 2023
Gaussian processes at the Helm(holtz): A more fluid model for ocean currentsRenato Berlinghieri, Brian L. Trippe, David R. Burt et al.
Given sparse observations of buoy velocities, oceanographers are interested in reconstructing ocean currents away from the buoys and identifying divergences in a current vector field. As a first and modular step, we focus on the time-stationary case - for instance, by restricting to short time periods. Since we expect current velocity to be a continuous but highly non-linear function of spatial location, Gaussian processes (GPs) offer an attractive model. But we show that applying a GP with a standard stationary kernel directly to buoy data can struggle at both current reconstruction and divergence identification, due to some physically unrealistic prior assumptions. To better reflect known physical properties of currents, we propose to instead put a standard stationary kernel on the divergence and curl-free components of a vector field obtained through a Helmholtz decomposition. We show that, because this decomposition relates to the original vector field just via mixed partial derivatives, we can still perform inference given the original data with only a small constant multiple of additional computational expense. We illustrate the benefits of our method with theory and experiments on synthetic and real ocean data.
LGMay 3, 2022Code
Subspace Diffusion Generative ModelsBowen Jing, Gabriele Corso, Renato Berlinghieri et al.
Score-based models generate samples by mapping noise to data (and vice versa) via a high-dimensional diffusion process. We question whether it is necessary to run this entire process at high dimensionality and incur all the inconveniences thereof. Instead, we restrict the diffusion via projections onto subspaces as the data distribution evolves toward noise. When applied to state-of-the-art models, our framework simultaneously improves sample quality -- reaching an FID of 2.17 on unconditional CIFAR-10 -- and reduces the computational cost of inference for the same number of denoising steps. Our framework is fully compatible with continuous-time diffusion and retains its flexible capabilities, including exact log-likelihoods and controllable generation. Code is available at https://github.com/bjing2016/subspace-diffusion.
MLAug 12, 2024
Multi-marginal Schrödinger Bridges with Iterative Reference RefinementYunyi Shen, Renato Berlinghieri, Tamara Broderick
Practitioners often aim to infer an unobserved population trajectory using sample snapshots at multiple time points. E.g., given single-cell sequencing data, scientists would like to learn how gene expression changes over a cell's life cycle. But sequencing any cell destroys that cell. So we can access data for any particular cell only at a single time point, but we have data across many cells. The deep learning community has recently explored using Schrödinger bridges (SBs) and their extensions in similar settings. However, existing methods either (1) interpolate between just two time points or (2) require a single fixed reference dynamic (often set to Brownian motion within SBs). But learning piecewise from adjacent time points can fail to capture long-term dependencies. And practitioners are typically able to specify a model family for the reference dynamic but not the exact values of the parameters within it. So we propose a new method that (1) learns the unobserved trajectories from sample snapshots across multiple time points and (2) requires specification only of a family of reference dynamics, not a single fixed one. We demonstrate the advantages of our method on simulated and real data.
LGSep 9, 2024
Are Hourly PM2.5 Forecasts Sufficiently Accurate to Plan Your Day? Individual Decision Making in the Face of Increasing Wildfire SmokeRenato Berlinghieri, David R. Burt, Paolo Giani et al.
Wildfire frequency is increasing as the climate changes, and the resulting air pollution poses health risks. Just as people routinely use hourly weather forecasts to plan their day's activities around precipitation, reliable hourly air quality forecasts could help individuals reduce their exposure to air pollution. In the present work, we evaluate six existing forecasts of ground-level fine particulate matter (PM2.5) within the continental United States during the 2023 fire season. We include forecasts using physical simulation, ensembling, and artificial intelligence. We focus our evaluation on individual decisions, such as (1) whether to go outside on a day with potentially high PM2.5 or (2) when to go outside for the lowest PM2.5 exposure. Our evaluation consists of both visualizations of hourly PM2.5 forecasts in particular locations as well as metrics summarizing forecast skill for the two tasks above. As part of our analysis, we introduce a new evaluation metric for the task of deciding when to go outside. We find meaningful room for improvement in PM2.5 forecasting, which might be realized by improving physical models, incorporating more data sources, and using artificial intelligence tools.
MLMay 21, 2025
Oh SnapMMD! Forecasting Stochastic Dynamics Beyond the Schrödinger Bridge's EndRenato Berlinghieri, Yunyi Shen, Jialong Jiang et al.
Scientists often want to make predictions beyond the observed time horizon of "snapshot" data following latent stochastic dynamics. For example, in time course single-cell mRNA profiling, scientists have access to cellular transcriptional state measurements (snapshots) from different biological replicates at different time points, but they cannot access the trajectory of any one cell because measurement destroys the cell. Researchers want to forecast (e.g.) differentiation outcomes from early state measurements of stem cells. Recent Schrödinger-bridge (SB) methods are natural for interpolating between snapshots. But past SB papers have not addressed forecasting -- likely since existing methods either (1) reduce to following pre-set reference dynamics (chosen before seeing data) or (2) require the user to choose a fixed, state-independent volatility since they minimize a Kullback-Leibler divergence. Either case can lead to poor forecasting quality. In the present work, we propose a new framework, SnapMMD, that learns dynamics by directly fitting the joint distribution of both state measurements and observation time with a maximum mean discrepancy (MMD) loss. Unlike past work, our method allows us to infer unknown and state-dependent volatilities from the observed data. We show in a variety of real and synthetic experiments that our method delivers accurate forecasts. Moreover, our approach allows us to learn in the presence of incomplete state measurements and yields an $R^2$-style statistic that diagnoses fit. We also find that our method's performance at interpolation (and general velocity-field reconstruction) is at least as good as (and often better than) state-of-the-art in almost all of our experiments.
MLFeb 9, 2025
Smooth Sailing: Lipschitz-Driven Uncertainty Quantification for Spatial AssociationDavid R. Burt, Renato Berlinghieri, Stephen Bates et al.
Estimating associations between spatial covariates and responses - rather than merely predicting responses - is central to environmental science, epidemiology, and economics. For instance, public health officials might be interested in whether air pollution has a strictly positive association with a health outcome, and the magnitude of any effect. Standard machine learning methods often provide accurate predictions but offer limited insight into covariate-response relationships. And we show that existing methods for constructing confidence (or credible) intervals for associations can fail to provide nominal coverage in the face of model misspecification and nonrandom locations - despite both being essentially always present in spatial problems. We introduce a method that constructs valid frequentist confidence intervals for associations in spatial settings. Our method requires minimal assumptions beyond a form of spatial smoothness and a homoskedastic Gaussian error assumption. In particular, we do not require model correctness or covariate overlap between training and target locations. Our approach is the first to guarantee nominal coverage in this setting and outperforms existing techniques in both real and simulated experiments. Our confidence intervals are valid in finite samples when the noise of the Gaussian error is known, and we provide an asymptotically consistent estimation procedure for this noise variance when it is unknown.
MESep 1, 2025
Wrong Model, Right Uncertainty: Spatial Associations for Discrete Data with MisspecificationDavid R. Burt, Renato Berlinghieri, Tamara Broderick
Scientists are often interested in estimating an association between a covariate and a binary- or count-valued response. For instance, public health officials are interested in how much disease presence (a binary response per individual) varies as temperature or pollution (covariates) increases. Many existing methods can be used to estimate associations, and corresponding uncertainty intervals, but make unrealistic assumptions in the spatial domain. For instance, they incorrectly assume models are well-specified. Or they assume the training and target locations are i.i.d. -- whereas in practice, these locations are often not even randomly sampled. Some recent work avoids these assumptions but works only for continuous responses with spatially constant noise. In the present work, we provide the first confidence intervals with guaranteed asymptotic nominal coverage for spatial associations given discrete responses, even under simultaneous model misspecification and nonrandom sampling of spatial locations. To do so, we demonstrate how to handle spatially varying noise, provide a novel proof of consistency for our proposed estimator, and use a delta method argument with a Lyapunov central limit theorem. We show empirically that standard approaches can produce unreliable confidence intervals and can even get the sign of an association wrong, while our method reliably provides correct coverage.