MLOct 14, 2022
Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover TreesAlexander Terenin, David R. Burt, Artem Artemev et al.
Gaussian processes are frequently deployed as part of larger machine learning and decision-making systems, for instance in geospatial modeling, Bayesian optimization, or in latent Gaussian models. Within a system, the Gaussian process model needs to perform in a stable and reliable manner to ensure it interacts correctly with other parts of the system. In this work, we study the numerical stability of scalable sparse approximations based on inducing points. To do so, we first review numerical stability, and illustrate typical situations in which Gaussian process models can be unstable. Building on stability theory originally developed in the interpolation literature, we derive sufficient and in certain cases necessary conditions on the inducing points for the computations performed to be numerically stable. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions. This is done via a modification of the cover tree data structure, which is of independent interest. We additionally propose an alternative sparse approximation for regression with a Gaussian likelihood which trades off a small amount of performance to further improve stability. We provide illustrative examples showing the relationship between stability of calculations and predictive performance of inducing point methods on spatial tasks.
MLNov 4, 2022
Sparse Gaussian Process Hyperparameters: Optimize or Integrate?Vidhi Lalchand, Wessel P. Bruinsma, David R. Burt et al.
The kernel function and its hyperparameters are the central model selection choice in a Gaussian proces (Rasmussen and Williams, 2006). Typically, the hyperparameters of the kernel are chosen by maximising the marginal likelihood, an approach known as Type-II maximum likelihood (ML-II). However, ML-II does not account for hyperparameter uncertainty, and it is well-known that this can lead to severely biased estimates and an underestimation of predictive uncertainty. While there are several works which employ a fully Bayesian characterisation of GPs, relatively few propose such approaches for the sparse GPs paradigm. In this work we propose an algorithm for sparse Gaussian process regression which leverages MCMC to sample from the hyperparameter posterior within the variational inducing point framework of Titsias (2009). This work is closely related to Hensman et al. (2015b) but side-steps the need to sample the inducing points, thereby significantly improving sampling efficiency in the Gaussian likelihood case. We compare this scheme against natural baselines in literature along with stochastic variational GPs (SVGPs) along with an extensive computational analysis.
MEFeb 20, 2023
Gaussian processes at the Helm(holtz): A more fluid model for ocean currentsRenato Berlinghieri, Brian L. Trippe, David R. Burt et al.
Given sparse observations of buoy velocities, oceanographers are interested in reconstructing ocean currents away from the buoys and identifying divergences in a current vector field. As a first and modular step, we focus on the time-stationary case - for instance, by restricting to short time periods. Since we expect current velocity to be a continuous but highly non-linear function of spatial location, Gaussian processes (GPs) offer an attractive model. But we show that applying a GP with a standard stationary kernel directly to buoy data can struggle at both current reconstruction and divergence identification, due to some physically unrealistic prior assumptions. To better reflect known physical properties of currents, we propose to instead put a standard stationary kernel on the divergence and curl-free components of a vector field obtained through a Helmholtz decomposition. We show that, because this decomposition relates to the original vector field just via mixed partial derivatives, we can still perform inference given the original data with only a small constant multiple of additional computational expense. We illustrate the benefits of our method with theory and experiments on synthetic and real ocean data.
MLMay 15, 2022
A Note on the Chernoff Bound for Random Variables in the Unit IntervalAndrew Y. K. Foong, Wessel P. Bruinsma, David R. Burt
The Chernoff bound is a well-known tool for obtaining a high probability bound on the expectation of a Bernoulli random variable in terms of its sample average. This bound is commonly used in statistical learning theory to upper bound the generalisation risk of a hypothesis in terms of its empirical risk on held-out data, for the case of a binary-valued loss function. However, the extension of this bound to the case of random variables taking values in the unit interval is less well known in the community. In this note we provide a proof of this extension for convenience and future reference.
LGSep 9, 2024
Are Hourly PM2.5 Forecasts Sufficiently Accurate to Plan Your Day? Individual Decision Making in the Face of Increasing Wildfire SmokeRenato Berlinghieri, David R. Burt, Paolo Giani et al.
Wildfire frequency is increasing as the climate changes, and the resulting air pollution poses health risks. Just as people routinely use hourly weather forecasts to plan their day's activities around precipitation, reliable hourly air quality forecasts could help individuals reduce their exposure to air pollution. In the present work, we evaluate six existing forecasts of ground-level fine particulate matter (PM2.5) within the continental United States during the 2023 fire season. We include forecasts using physical simulation, ensembling, and artificial intelligence. We focus our evaluation on individual decisions, such as (1) whether to go outside on a day with potentially high PM2.5 or (2) when to go outside for the lowest PM2.5 exposure. Our evaluation consists of both visualizations of hourly PM2.5 forecasts in particular locations as well as metrics summarizing forecast skill for the two tasks above. As part of our analysis, we introduce a new evaluation metric for the task of deciding when to go outside. We find meaningful room for improvement in PM2.5 forecasting, which might be realized by improving physical models, incorporating more data sources, and using artificial intelligence tools.
MLFeb 5, 2024
Consistent Validation for Predictive Methods in Spatial SettingsDavid R. Burt, Yunyi Shen, Tamara Broderick
Spatial prediction tasks are key to weather forecasting, studying air pollution impacts, and other scientific endeavors. Determining how much to trust predictions made by statistical or physical methods is essential for the credibility of scientific conclusions. Unfortunately, classical approaches for validation fail to handle mismatch between locations available for validation and (test) locations where we want to make predictions. This mismatch is often not an instance of covariate shift (as commonly formalized) because the validation and test locations are fixed (e.g., on a grid or at select points) rather than i.i.d. from two distributions. In the present work, we formalize a check on validation methods: that they become arbitrarily accurate as validation data becomes arbitrarily dense. We show that classical and covariate-shift methods can fail this check. We propose a method that builds from existing ideas in the covariate-shift literature, but adapts them to the validation data at hand. We prove that our proposal passes our check. And we demonstrate its advantages empirically on simulated and real data.
MLFeb 9, 2025
Smooth Sailing: Lipschitz-Driven Uncertainty Quantification for Spatial AssociationDavid R. Burt, Renato Berlinghieri, Stephen Bates et al.
Estimating associations between spatial covariates and responses - rather than merely predicting responses - is central to environmental science, epidemiology, and economics. For instance, public health officials might be interested in whether air pollution has a strictly positive association with a health outcome, and the magnitude of any effect. Standard machine learning methods often provide accurate predictions but offer limited insight into covariate-response relationships. And we show that existing methods for constructing confidence (or credible) intervals for associations can fail to provide nominal coverage in the face of model misspecification and nonrandom locations - despite both being essentially always present in spatial problems. We introduce a method that constructs valid frequentist confidence intervals for associations in spatial settings. Our method requires minimal assumptions beyond a form of spatial smoothness and a homoskedastic Gaussian error assumption. In particular, we do not require model correctness or covariate overlap between training and target locations. Our approach is the first to guarantee nominal coverage in this setting and outperforms existing techniques in both real and simulated experiments. Our confidence intervals are valid in finite samples when the noise of the Gaussian error is known, and we provide an asymptotically consistent estimation procedure for this noise variance when it is unknown.
MESep 1, 2025
Wrong Model, Right Uncertainty: Spatial Associations for Discrete Data with MisspecificationDavid R. Burt, Renato Berlinghieri, Tamara Broderick
Scientists are often interested in estimating an association between a covariate and a binary- or count-valued response. For instance, public health officials are interested in how much disease presence (a binary response per individual) varies as temperature or pollution (covariates) increases. Many existing methods can be used to estimate associations, and corresponding uncertainty intervals, but make unrealistic assumptions in the spatial domain. For instance, they incorrectly assume models are well-specified. Or they assume the training and target locations are i.i.d. -- whereas in practice, these locations are often not even randomly sampled. Some recent work avoids these assumptions but works only for continuous responses with spatially constant noise. In the present work, we provide the first confidence intervals with guaranteed asymptotic nominal coverage for spatial associations given discrete responses, even under simultaneous model misspecification and nonrandom sampling of spatial locations. To do so, we demonstrate how to handle spatially varying noise, provide a novel proof of consistency for our proposed estimator, and use a delta method argument with a Lyapunov central limit theorem. We show empirically that standard approaches can produce unreliable confidence intervals and can even get the sign of an association wrong, while our method reliably provides correct coverage.
LGFeb 23, 2022
Wide Mean-Field Bayesian Neural Networks Ignore the DataBeau Coker, Wessel P. Bruinsma, David R. Burt et al.
Bayesian neural networks (BNNs) combine the expressive power of deep learning with the advantages of Bayesian formalism. In recent years, the analysis of wide, deep BNNs has provided theoretical insight into their priors and posteriors. However, we have no analogous insight into their posteriors under approximate inference. In this work, we show that mean-field variational inference entirely fails to model the data when the network width is large and the activation function is odd. Specifically, for fully-connected BNNs with odd activation functions and a homoscedastic Gaussian likelihood, we show that the optimal mean-field variational posterior predictive (i.e., function space) distribution converges to the prior predictive distribution as the width tends to infinity. We generalize aspects of this result to other likelihoods. Our theoretical results are suggestive of underfitting behavior previously observered in BNNs. While our convergence bounds are non-asymptotic and constants in our analysis can be computed, they are currently too loose to be applicable in standard training regimes. Finally, we show that the optimal approximate posterior need not tend to the prior if the activation function is not odd, showing that our statements cannot be generalized arbitrarily.
MLSep 20, 2021
Barely Biased Learning for Gaussian Process RegressionDavid R. Burt, Artem Artemev, Mark van der Wilk
Recent work in scalable approximate Gaussian process regression has discussed a bias-variance-computation trade-off when estimating the log marginal likelihood. We suggest a method that adaptively selects the amount of computation to use when estimating the log marginal likelihood so that the bias of the objective function is guaranteed to be small. While simple in principle, our current implementation of the method is not competitive computationally with existing approximations.
MLJun 7, 2021
How Tight Can PAC-Bayes be in the Small Data Regime?Andrew Y. K. Foong, Wessel P. Bruinsma, David R. Burt et al.
In this paper, we investigate the question: Given a small number of datapoints, for example N = 30, how tight can PAC-Bayes and test set bounds be made? For such small datasets, test set bounds adversely affect generalisation performance by withholding data from the training procedure. In this setting, PAC-Bayes bounds are especially attractive, due to their ability to use all the data to simultaneously learn a posterior and bound its generalisation risk. We focus on the case of i.i.d. data with a bounded loss and consider the generic PAC-Bayes theorem of Germain et al. While their theorem is known to recover many existing PAC-Bayes bounds, it is unclear what the tightest bound derivable from their framework is. For a fixed learning algorithm and dataset, we show that the tightest possible bound coincides with a bound considered by Catoni; and, in the more natural case of distributions over datasets, we establish a lower bound on the best bound achievable in expectation. Interestingly, this lower bound recovers the Chernoff test set bound if the posterior is equal to the prior. Moreover, to illustrate how tight these bounds can be, we study synthetic one-dimensional classification tasks in which it is feasible to meta-learn both the prior and the form of the bound to numerically optimise for the tightest bounds possible. We find that in this simple, controlled scenario, PAC-Bayes bounds are competitive with comparable, commonly used Chernoff test set bounds. However, the sharpest test set bounds still lead to better guarantees on the generalisation error than the PAC-Bayes bounds we consider.
MLFeb 16, 2021
Tighter Bounds on the Log Marginal Likelihood of Gaussian Process Regression Using Conjugate GradientsArtem Artemev, David R. Burt, Mark van der Wilk
We propose a lower bound on the log marginal likelihood of Gaussian process regression models that can be computed without matrix factorisation of the full kernel matrix. We show that approximate maximum likelihood learning of model parameters by maximising our lower bound retains many of the sparse variational approach benefits while reducing the bias introduced into parameter learning. The basis of our bound is a more careful analysis of the log-determinant term appearing in the log marginal likelihood, as well as using the method of conjugate gradients to derive tight lower bounds on the term involving a quadratic form. Our approach is a step forward in unifying methods relying on lower bound maximisation (e.g. variational methods) and iterative approaches based on conjugate gradients for training Gaussian processes. In experiments, we show improved predictive performance with our model for a comparable amount of training time compared to other conjugate gradient based approaches.
MLNov 18, 2020
Understanding Variational Inference in Function-SpaceDavid R. Burt, Sebastian W. Ober, Adrià Garriga-Alonso et al.
Recent work has attempted to directly approximate the `function-space' or predictive posterior distribution of Bayesian models, without approximating the posterior distribution over the parameters. This is appealing in e.g. Bayesian neural networks, where we only need the former, and the latter is hard to represent. In this work, we highlight some advantages and limitations of employing the Kullback-Leibler divergence in this setting. For example, we show that minimizing the KL divergence between a wide class of parametric distributions and the posterior induced by a (non-degenerate) Gaussian process prior leads to an ill-defined objective function. Then, we propose (featurized) Bayesian linear regression as a benchmark for `function-space' inference methods that directly measures approximation quality. We apply this methodology to assess aspects of the objective function and inference scheme considered in Sun, Zhang, Shi, and Grosse (2018), emphasizing the quality of approximation to Bayesian inference as opposed to predictive performance.
MLAug 1, 2020
Convergence of Sparse Variational Inference in Gaussian Processes RegressionDavid R. Burt, Carl Edward Rasmussen, Mark van der Wilk
Gaussian processes are distributions over functions that are versatile and mathematically convenient priors in Bayesian modelling. However, their use is often impeded for data with large numbers of observations, $N$, due to the cubic (in $N$) cost of matrix operations used in exact inference. Many solutions have been proposed that rely on $M \ll N$ inducing variables to form an approximation at a cost of $\mathcal{O}(NM^2)$. While the computational cost appears linear in $N$, the true complexity depends on how $M$ must scale with $N$ to ensure a certain quality of the approximation. In this work, we investigate upper and lower bounds on how $M$ needs to grow with $N$ to ensure high quality approximations. We show that we can make the KL-divergence between the approximate model and the exact posterior arbitrarily small for a Gaussian-noise regression model with $M\ll N$. Specifically, for the popular squared exponential kernel and $D$-dimensional Gaussian distributed covariates, $M=\mathcal{O}((\log N)^D)$ suffice and a method with an overall computational cost of $\mathcal{O}(N(\log N)^{2D}(\log\log N)^2)$ can be used to perform inference.
MLJun 23, 2020
Variational Orthogonal FeaturesDavid R. Burt, Carl Edward Rasmussen, Mark van der Wilk
Sparse stochastic variational inference allows Gaussian process models to be applied to large datasets. The per iteration computational cost of inference with this method is $\mathcal{O}(\tilde{N}M^2+M^3),$ where $\tilde{N}$ is the number of points in a minibatch and $M$ is the number of `inducing features', which determine the expressiveness of the variational family. Several recent works have shown that for certain priors, features can be defined that remove the $\mathcal{O}(M^3)$ cost of computing a minibatch estimate of an evidence lower bound (ELBO). This represents a significant computational savings when $M\gg \tilde{N}$. We present a construction of features for any stationary prior kernel that allow for computation of an unbiased estimator to the ELBO using $T$ Monte Carlo samples in $\mathcal{O}(\tilde{N}T+M^2T)$ and in $\mathcal{O}(\tilde{N}T+MT)$ with an additional approximation. We analyze the impact of this additional approximation on inference quality.
LGJan 28, 2020
Bandit optimisation of functions in the Matérn kernel RKHSDavid Janz, David R. Burt, Javier González
We consider the problem of optimising functions in the reproducing kernel Hilbert space (RKHS) of a Matérn kernel with smoothness parameter $ν$ over the domain $[0,1]^d$ under noisy bandit feedback. Our contribution, the $π$-GP-UCB algorithm, is the first practical approach with guaranteed sublinear regret for all $ν>1$ and $d \geq 1$. Empirical validation suggests better performance and drastically improved computational scalablity compared with its predecessor, Improved GP-UCB.
MLSep 2, 2019
On the Expressiveness of Approximate Inference in Bayesian Neural NetworksAndrew Y. K. Foong, David R. Burt, Yingzhen Li et al.
While Bayesian neural networks (BNNs) hold the promise of being flexible, well-calibrated statistical models, inference often requires approximations whose consequences are poorly understood. We study the quality of common variational methods in approximating the Bayesian predictive distribution. For single-hidden layer ReLU BNNs, we prove a fundamental limitation in function-space of two of the most commonly used distributions defined in weight-space: mean-field Gaussian and Monte Carlo dropout. We find there are simple cases where neither method can have substantially increased uncertainty in between well-separated regions of low uncertainty. We provide strong empirical evidence that exact inference does not have this pathology, hence it is due to the approximation and not the model. In contrast, for deep networks, we prove a universality result showing that there exist approximate posteriors in the above classes which provide flexible uncertainty estimates. However, we find empirically that pathologies of a similar form as in the single-hidden layer case can persist when performing variational inference in deeper networks. Our results motivate careful consideration of the implications of approximate inference methods in BNNs.
MLMar 8, 2019
Rates of Convergence for Sparse Variational Gaussian Process RegressionDavid R. Burt, Carl E. Rasmussen, Mark van der Wilk
Excellent variational approximations to Gaussian process posteriors have been developed which avoid the $\mathcal{O}\left(N^3\right)$ scaling with dataset size $N$. They reduce the computational cost to $\mathcal{O}\left(NM^2\right)$, with $M\ll N$ being the number of inducing variables, which summarise the process. While the computational cost seems to be linear in $N$, the true complexity of the algorithm depends on how $M$ must increase to ensure a certain quality of approximation. We address this by characterising the behavior of an upper bound on the KL divergence to the posterior. We show that with high probability the KL divergence can be made arbitrarily small by growing $M$ more slowly than $N$. A particular case of interest is that for regression with normally distributed inputs in D-dimensions with the popular Squared Exponential kernel, $M=\mathcal{O}(\log^D N)$ is sufficient. Our results show that as datasets grow, Gaussian process posteriors can truly be approximated cheaply, and provide a concrete rule for how to increase $M$ in continual learning scenarios.