Prior-preconditioned conjugate gradient method for accelerated Gibbs sampling in "large $n$ & large $p$" Bayesian sparse regression
This addresses the problem of slow posterior inference in large-scale Bayesian regression for healthcare data analysis, representing an incremental improvement in computational efficiency.
The paper tackles the computational bottleneck of sampling from high-dimensional Gaussian distributions in Bayesian sparse regression for large datasets by introducing a prior-preconditioned conjugate gradient method, achieving an order of magnitude speed-up that reduces computation time from two weeks to less than a day in a clinical study with n=72,489 and p=22,175.
In a modern observational study based on healthcare databases, the number of observations and of predictors typically range in the order of $10^5$ ~ $10^6$ and of $10^4$ ~ $10^5$. Despite the large sample size, data rarely provide sufficient information to reliably estimate such a large number of parameters. Sparse regression techniques provide potential solutions, one notable approach being the Bayesian methods based on shrinkage priors. In the "large n & large p" setting, however, posterior computation encounters a major bottleneck at repeated sampling from a high-dimensional Gaussian distribution, whose precision matrix $Φ$ is expensive to compute and factorize. In this article, we present a novel algorithm to speed up this bottleneck based on the following observation: we can cheaply generate a random vector $b$ such that the solution to the linear system $Φβ= b$ has the desired Gaussian distribution. We can then solve the linear system by the conjugate gradient (CG) algorithm through matrix-vector multiplications by $Φ$; this involves no explicit factorization or calculation of $Φ$ itself. Rapid convergence of CG in this context is guaranteed by the theory of prior-preconditioning we develop. We apply our algorithm to a clinically relevant large-scale observational study with n = 72,489 patients and p = 22,175 clinical covariates, designed to assess the relative risk of adverse events from two alternative blood anti-coagulants. Our algorithm demonstrates an order of magnitude speed-up in posterior inference, in our case cutting the computation time from two weeks to less than a day.