ML LGJun 1

Scalable Derivative Gaussian Processes via Exact Gradient Reduction

arXiv:2606.0290911.2

Predicted impact top 47% in ML · last 90 daysOriginality Highly original

AI Analysis

For practitioners using Gaussian processes with gradient observations in high-dimensional settings, this work provides a practical method to overcome the prohibitive cubic scaling, enabling scalable inference.

The paper introduces TERA, a scalable derivative Gaussian process method that reduces computational cost from O(n^3 d^3) to O(d m^2 + m^6) by exploiting conditional independence of gradient components, achieving state-of-the-art accuracy with orders of magnitude speedup and memory reduction that scales independently of dimension d.

Gradient observations can substantially improve Gaussian process (GP) surrogates, particularly in high-dimensional settings where function evaluations are expensive. However, exact inference with $n$ function values and $n$ full gradients in $d$ dimensions scales cubically in the joint state size, imposing an intractable $\mathcal{O}(n^3 d^3)$ computational bottleneck. We introduce TERA, a highly scalable derivative GP method based on target-specific exact gradient reduction. We prove that for stationary kernels, the gradient components orthogonal to the directions connecting the target and conditioning points are conditionally independent of the target function value; consequently, the exact conditional density is fully characterized by at most $m^2$ directional derivatives once a conditioning set of size $m$ is specified. By using these reduced, dimension-free conditionals as local factors in a Vecchia approximation, TERA effectively decouples $n$ and $d$ from the dense matrix inversion. This reduces the per-target evaluation cost to $\mathcal{O}(dm^2 + m^6)$ time and $\mathcal{O}(dm^2 + m^4)$ memory, leaving the underlying derivative GP model mathematically unchanged. Empirical evaluations demonstrate that TERA achieves state-of-the-art predictive accuracy while operating orders of magnitude faster than standard derivative GPs. Crucially, both computation time and peak GPU memory remain essentially flat with respect to $d$, enabling highly scalable inference in high-dimensional spaces.

View on arXiv PDF

Similar