MLJul 27, 2023
Linear Convergence of Black-Box Variational Inference: Should We Stick the Landing?Kyurae Kim, Yian Ma, Jacob R. Gardner · deepmind
We prove that black-box variational inference (BBVI) with control variates, particularly the sticking-the-landing (STL) estimator, converges at a geometric (traditionally called "linear") rate under perfect variational family specification. In particular, we prove a quadratic bound on the gradient variance of the STL estimator, one which encompasses misspecified variational families. Combined with previous works on the quadratic variance condition, this directly implies convergence of BBVI with the use of projected stochastic gradient descent. For the projection operator, we consider a domain with triangular scale matrices, which the projection onto is computable in $Θ(d)$ time, where $d$ is the dimensionality of the target posterior. We also improve existing analysis on the regular closed-form entropy gradient estimators, which enables comparison against the STL estimator, providing explicit non-asymptotic complexity guarantees for both.
CLOct 10, 2022
Extracting or Guessing? Improving Faithfulness of Event Temporal Relation ExtractionHaoyu Wang, Hongming Zhang, Yuqian Deng et al.
In this paper, we seek to improve the faithfulness of TempRel extraction models from two perspectives. The first perspective is to extract genuinely based on contextual description. To achieve this, we propose to conduct counterfactual analysis to attenuate the effects of two significant types of training biases: the event trigger bias and the frequent label bias. We also add tense information into event representations to explicitly place an emphasis on the contextual description. The second perspective is to provide proper uncertainty estimation and abstain from extraction when no relation is described in the text. By parameterization of Dirichlet Prior over the model-predicted categorical distribution, we improve the model estimates of the correctness likelihood and make TempRel predictions more selective. We also employ temperature scaling to recalibrate the model confidence measure after bias mitigation. Through experimental analysis on MATRES, MATRES-DS, and TDDiscourse, we demonstrate that our model extracts TempRel and timelines more faithfully compared to SOTA methods, especially under distribution shifts.
LGOct 21, 2022
Local Bayesian optimization via maximizing probability of descentQuan Nguyen, Kaiwen Wu, Jacob R. Gardner et al.
Local optimization presents a promising approach to expensive, high-dimensional black-box optimization by sidestepping the need to globally explore the search space. For objective functions whose gradient cannot be evaluated directly, Bayesian optimization offers one solution -- we construct a probabilistic model of the objective, design a policy to learn about the gradient at the current location, and use the resulting information to navigate the objective landscape. Previous work has realized this scheme by minimizing the variance in the estimate of the gradient, then moving in the direction of the expected gradient. In this paper, we re-examine and refine this approach. We demonstrate that, surprisingly, the expected value of the gradient is not always the direction maximizing the probability of descent, and in fact, these directions may be nearly orthogonal. This observation then inspires an elegant optimization scheme seeking to maximize the probability of descent while moving in the direction of most-probable descent. Experiments on both synthetic and real-world objectives show that our method outperforms previous realizations of this optimization scheme and is competitive against other, significantly more complicated baselines.
LGOct 26, 2023
Large-Scale Gaussian Processes via Alternating ProjectionKaiwen Wu, Jonathan Wenger, Haydn Jones et al.
Training and inference in Gaussian processes (GPs) require solving linear systems with $n\times n$ kernel matrices. To address the prohibitive $\mathcal{O}(n^3)$ time complexity, recent work has employed fast iterative methods, like conjugate gradients (CG). However, as datasets increase in magnitude, the kernel matrices become increasingly ill-conditioned and still require $\mathcal{O}(n^2)$ space without partitioning. Thus, while CG increases the size of datasets GPs can be trained on, modern datasets reach scales beyond its applicability. In this work, we propose an iterative method which only accesses subblocks of the kernel matrix, effectively enabling mini-batching. Our algorithm, based on alternating projection, has $\mathcal{O}(n)$ per-iteration time and space complexity, solving many of the practical challenges of scaling GPs to very large datasets. Theoretically, we prove the method enjoys linear convergence. Empirically, we demonstrate its fast convergence in practice and robustness to ill-conditioning. On large-scale benchmark datasets with up to four million data points, our approach accelerates GP training and inference by speed-up factors up to $27\times$ and $72 \times$, respectively, compared to CG.
LGJun 13, 2022
Markov Chain Score Ascent: A Unifying Framework of Variational Inference with Markovian GradientsKyurae Kim, Jisu Oh, Jacob R. Gardner et al.
Minimizing the inclusive Kullback-Leibler (KL) divergence with stochastic gradient descent (SGD) is challenging since its gradient is defined as an integral over the posterior. Recently, multiple methods have been proposed to run SGD with biased gradient estimates obtained from a Markov chain. This paper provides the first non-asymptotic convergence analysis of these methods by establishing their mixing rate and gradient variance. To do this, we demonstrate that these methods-which we collectively refer to as Markov chain score ascent (MCSA) methods-can be cast as special cases of the Markov chain gradient descent framework. Furthermore, by leveraging this new understanding, we develop a novel MCSA scheme, parallel MCSA (pMCSA), that achieves a tighter bound on the gradient variance. We demonstrate that this improved theoretical result translates to superior empirical performance.
LGMar 18, 2023
Practical and Matching Gradient Variance Bounds for Black-Box Variational Bayesian InferenceKyurae Kim, Kaiwen Wu, Jisu Oh et al.
Understanding the gradient variance of black-box variational inference (BBVI) is a crucial step for establishing its convergence and developing algorithmic improvements. However, existing studies have yet to show that the gradient variance of BBVI satisfies the conditions used to study the convergence of stochastic gradient descent (SGD), the workhorse of BBVI. In this work, we show that BBVI satisfies a matching bound corresponding to the $ABC$ condition used in the SGD literature when applied to smooth and quadratically-growing log-likelihoods. Our results generalize to nonlinear covariance parameterizations widely used in the practice of BBVI. Furthermore, we show that the variance of the mean-field parameterization has provably superior dimensional dependence.
DBMar 2
Adversarial Query Synthesis via Bayesian OptimizationJeffrey Tao, Yimeng Zeng, Haydn Thomas Jones et al.
Benchmark workloads are extremely important to the database management research community, especially as more machine learning components are integrated into database systems. Here, we propose a Bayesian optimization technique to automatically search for difficult benchmark queries, significantly reducing the amount of manual effort usually required. In preliminary experiments, we show that our approach can generate queries with more than double the optimization headroom compared to existing benchmarks.
LGJan 29
Purely Agentic Black-Box Optimization for Biological DesignNatalie Maus, Yimeng Zeng, Haydn Thomas Jones et al.
Many key challenges in biological design-such as small-molecule drug discovery, antimicrobial peptide development, and protein engineering-can be framed as black-box optimization over vast, complex structured spaces. Existing methods rely mainly on raw structural data and struggle to exploit the rich scientific literature. While large language models (LLMs) have been added to these pipelines, they have been confined to narrow roles within structure-centered optimizers. We instead cast biological black-box optimization as a fully agentic, language-based reasoning process. We introduce Purely Agentic BLack-box Optimization (PABLO), a hierarchical agentic system that uses scientific LLMs pretrained on chemistry and biology literature to generate and iteratively refine biological candidates. On both the standard GuacaMol molecular design and antimicrobial peptide optimization tasks, PABLO achieves state-of-the-art performance, substantially improving sample efficiency and final objective values over established baselines. Compared to prior optimization methods that incorporate LLMs, PABLO achieves competitive token usage per run despite relying on LLMs throughout the optimization loop. Beyond raw performance, the agentic formulation offers key advantages for realistic design: it naturally incorporates semantic task descriptions, retrieval-augmented domain knowledge, and complex constraints. In follow-up in vitro validation, PABLO-optimized peptides showed strong activity against drug-resistant pathogens, underscoring the practical potential of PABLO for therapeutic discovery.
98.4LGMay 7Code
Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at ScaleHaydn Jones, Yimeng Zeng, Alden Rose et al.
Manually curated biomedical repositories -- spanning bioactivity, genomics, and chemistry -- are expensive to maintain, lag behind primary literature, and discard experimental context, obscuring nuances needed to assess data correctness and coverage. We show that PubMed itself can be autonomously and cost-effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace. We present three coupled contributions: (1) an LLM-based entity-tagging pipeline, grounded in nine biomedical ontologies, that tags 4.5B entities across 19 categories in a 22.5M-paper, 2.5T-token PubMed corpus; (2) hybrid sparse-dense retrieval supporting entity-filtered semantic queries over the tagged corpus; and (3) Starling, a multi-agent deep research system that, given only a natural-language task description, designs precision- and recall-targeted retrieval filters, induces an extraction schema, and emits structured records with nuance-rich fields and supporting passages. Across six tasks -- blood-brain barrier permeability, oral bioavailability, acute toxicity (LD50), gene-disease associations, protein subcellular localization, and chemical reactions -- Starling produces ~6.3M records (91K-3M per task); several are, to our knowledge, the largest public datasets for their property. Frontier-model rejection of our extractions is 0.6-7.7% across tasks, far below error rates we measure on widely used curated counterparts (e.g., 16.5% on BBB_Martins, 7.3% on Bioavailability_Ma). Beyond scale and accuracy, the supporting passages carry nuance tabular databases discard -- e.g., oral bioavailability may depend on fed vs. fasted state. Together, the corpus, retrieval, and agent establish a foundation for AI-driven therapeutic design. Code and datasets: https://github.com/starling-labs/starling.
LGJul 15, 2024
A Fast, Robust Elliptical Slice Sampling Implementation for Linearly Truncated Multivariate Normal DistributionsKaiwen Wu, Jacob R. Gardner
Elliptical slice sampling, when adapted to linearly truncated multivariate normal distributions, is a rejection-free Markov chain Monte Carlo method. At its core, it requires analytically constructing an ellipse-polytope intersection. The main novelty of this paper is an algorithm that computes this intersection in $\mathcal{O}(m \log m)$ time, where $m$ is the number of linear inequality constraints representing the polytope. We show that an implementation based on this algorithm enhances numerical stability, speeds up running time, and is easy to parallelize for launching multiple Markov chains.
LGJan 29
Knowledge Gradient for Preference LearningKaiwen Wu, Jacob R. Gardner
The knowledge gradient is a popular acquisition function in Bayesian optimization (BO) for optimizing black-box objectives with noisy function evaluations. Many practical settings, however, allow only pairwise comparison queries, yielding a preferential BO problem where direct function evaluations are unavailable. Extending the knowledge gradient to preferential BO is hindered by its computational challenge. At its core, the look-ahead step in the preferential setting requires computing a non-Gaussian posterior, which was previously considered intractable. In this paper, we address this challenge by deriving an exact and analytical knowledge gradient for preferential BO. We show that the exact knowledge gradient performs strongly on a suite of benchmark problems, often outperforming existing acquisition functions. In addition, we also present a case study illustrating the limitation of the knowledge gradient in certain scenarios.
LGAug 14, 2025Code
A Dataset for Distilling Knowledge Priors from Literature for Therapeutic DesignHaydn Thomas Jones, Natalie Maus, Josh Magnus Ludan et al.
AI-driven discovery can greatly reduce design time and enhance new therapeutics' effectiveness. Models using simulators explore broad design spaces but risk violating implicit constraints due to a lack of experimental priors. For example, in a new analysis we performed on a diverse set of models on the GuacaMol benchmark using supervised classifiers, over 60\% of molecules proposed had high probability of being mutagenic. In this work, we introduce Medex, a dataset of priors for design problems extracted from literature describing compounds used in lab settings. It is constructed with LLM pipelines for discovering therapeutic entities in relevant paragraphs and summarizing information in concise fair-use facts. Medex consists of 32.3 million pairs of natural language facts, and appropriate entity representations (i.e. SMILES or refseq IDs). To demonstrate the potential of the data, we train LLM, CLIP, and LLava architectures to reason jointly about text and design targets and evaluate on tasks from the Therapeutic Data Commons (TDC). Medex is highly effective for creating models with strong priors: in supervised prediction problems that use our data as pretraining, our best models with 15M learnable parameters outperform larger 2B TxGemma on both regression and classification TDC tasks, and perform comparably to 9B models on average. Models built with Medex can be used as constraints while optimizing for novel molecules in GuacaMol, resulting in proposals that are safer and nearly as effective. We release our dataset at https://huggingface.co/datasets/medexanon/Medex, and will provide expanded versions as available literature grows.
LGJan 31, 2025Code
Covering Multiple Objectives with a Small Set of Solutions Using Bayesian OptimizationNatalie Maus, Kyurae Kim, Yimeng Zeng et al.
In multi-objective black-box optimization, the goal is typically to find solutions that optimize a set of $T$ black-box objective functions, $f_1, \ldots f_T$, simultaneously. Traditional approaches often seek a single Pareto-optimal set that balances trade-offs among all objectives. In contrast, we consider a problem setting that departs from this paradigm: finding a small set of $K < T$ solutions, that collectively "cover" the $T$ objectives. A set of solutions is defined as "covering" if, for each objective $f_1, \ldots f_T$, there is at least one good solution. A motivating example for this problem setting occurs in drug design. For example, we may have $T$ pathogens and aim to identify a set of $K < T$ antibiotics such that at least one antibiotic can be used to treat each pathogen. This problem, known as coverage optimization, has yet to be tackled with the Bayesian optimization (BO) framework. To fill this void, we develop Multi-Objective Coverage Bayesian Optimization (MOCOBO), a BO algorithm for solving coverage optimization. Our approach is based on a new acquisition function reminiscent of expected improvement in the vanilla BO setup. We demonstrate the performance of our method on high-dimensional black-box optimization tasks, including applications in peptide and molecular design. Results show that the coverage of the $K < T$ solutions found by MOCOBO matches or nearly matches the coverage of $T$ solutions obtained by optimizing each objective individually. Furthermore, in in vitro experiments, the peptides found by MOCOBO exhibited high potency against drug-resistant pathogens, further demonstrating the potential of MOCOBO for drug discovery. All of our code is publicly available at the following link: https://github.com/nataliemaus/mocobo.
MLSep 6, 2014Code
A Reduction of the Elastic Net to Support Vector Machines with an Application to GPU ComputingQuan Zhou, Wenlin Chen, Shiji Song et al.
The past years have witnessed many dedicated open-source projects that built and maintain implementations of Support Vector Machines (SVM), parallelized for GPU, multi-core CPUs and distributed systems. Up to this point, no comparable effort has been made to parallelize the Elastic Net, despite its popularity in many high impact applications, including genetics, neuroscience and systems biology. The first contribution in this paper is of theoretical nature. We establish a tight link between two seemingly different algorithms and prove that Elastic Net regression can be reduced to SVM with squared hinge loss classification. Our second contribution is to derive a practical algorithm based on this reduction. The reduction enables us to utilize prior efforts in speeding up and parallelizing SVMs to obtain a highly optimized and parallel solver for the Elastic Net and Lasso. With a simple wrapper, consisting of only 11 lines of MATLAB code, we obtain an Elastic Net implementation that naturally utilizes GPU and multi-core CPUs. We demonstrate on twelve real world data sets, that our algorithm yields identical results as the popular (and highly optimized) glmnet implementation but is one or several orders of magnitude faster.
LGNov 1, 2024
Computation-Aware Gaussian Processes: Model Selection And Linear-Time InferenceJonathan Wenger, Kaiwen Wu, Philipp Hennig et al.
Model selection in Gaussian processes scales prohibitively with the size of the training dataset, both in time and memory. While many approximations exist, all incur inevitable approximation error. Recent work accounts for this error in the form of computational uncertainty, which enables -- at the cost of quadratic complexity -- an explicit tradeoff between computation and precision. Here we extend this development to model selection, which requires significant enhancements to the existing approach, including linear-time scaling in the size of the dataset. We propose a novel training loss for hyperparameter optimization and demonstrate empirically that the resulting method can outperform SGPR, CGGP and SVGP, state-of-the-art methods for GP model selection, on medium to large-scale datasets. Our experiments show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU. As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty -- a fundamental prerequisite for optimal decision-making.
COFeb 27, 2024
Stochastic Approximation with Biased MCMC for Expectation MaximizationSamuel Gruffaz, Kyurae Kim, Alain Oliviero Durmus et al.
The expectation maximization (EM) algorithm is a widespread method for empirical Bayesian inference, but its expectation step (E-step) is often intractable. Employing a stochastic approximation scheme with Markov chain Monte Carlo (MCMC) can circumvent this issue, resulting in an algorithm known as MCMC-SAEM. While theoretical guarantees for MCMC-SAEM have previously been established, these results are restricted to the case where asymptotically unbiased MCMC algorithms are used. In practice, MCMC-SAEM is often run with asymptotically biased MCMC, for which the consequences are theoretically less understood. In this work, we fill this gap by analyzing the asymptotics and non-asymptotics of SAEM with biased MCMC steps, particularly the effect of bias. We also provide numerical experiments comparing the Metropolis-adjusted Langevin algorithm (MALA), which is asymptotically unbiased, and the unadjusted Langevin algorithm (ULA), which is asymptotically biased, on synthetic and real datasets. Experimental results show that ULA is more stable with respect to the choice of Langevin stepsize and can sometimes result in faster convergence.
MLMar 19, 2025
Tuning Sequential Monte Carlo Samplers via Greedy Incremental Divergence MinimizationKyurae Kim, Zuheng Xu, Jacob R. Gardner et al.
The performance of sequential Monte Carlo (SMC) samplers heavily depends on the tuning of the Markov kernels used in the path proposal. For SMC samplers with unadjusted Markov kernels, standard tuning objectives, such as the Metropolis-Hastings acceptance rate or the expected-squared jump distance, are no longer applicable. While stochastic gradient-based end-to-end optimization has been explored for tuning SMC samplers, they often incur excessive training costs, even for tuning just the kernel step sizes. In this work, we propose a general adaptation framework for tuning the Markov kernels in SMC samplers by minimizing the incremental Kullback-Leibler (KL) divergence between the proposal and target paths. For step size tuning, we provide a gradient- and tuning-free algorithm that is generally applicable for kernels such as Langevin Monte Carlo (LMC). We further demonstrate the utility of our approach by providing a tailored scheme for tuning kinetic LMC used in SMC samplers. Our implementations are able to obtain a full schedule of tuned parameters at the cost of a few vanilla SMC runs, which is a fraction of gradient-based approaches.
MLFeb 21
Stochastic Gradient Variational Inference with Price's Gradient Estimator from Bures-Wasserstein to Parameter SpaceKyurae Kim, Qiang Fu, Yi-An Ma et al.
For approximating a target distribution given only its unnormalized log-density, stochastic gradient-based variational inference (VI) algorithms are a popular approach. For example, Wasserstein VI (WVI) and black-box VI (BBVI) perform gradient descent in measure space (Bures-Wasserstein space) and parameter space, respectively. Previously, for the Gaussian variational family, convergence guarantees for WVI have shown superiority over existing results for black-box VI with the reparametrization gradient, suggesting the measure space approach might provide some unique benefits. In this work, however, we close this gap by obtaining identical state-of-the-art iteration complexity guarantees for both. In particular, we identify that WVI's superiority stems from the specific gradient estimator it uses, which BBVI can also leverage with minor modifications. The estimator in question is usually associated with Price's theorem and utilizes second-order information (Hessians) of the target log-density. We will refer to this as Price's gradient. On the flip side, WVI can be made more widely applicable by using the reparametrization gradient, which requires only gradients of the log-density. We empirically demonstrate that the use of Price's gradient is the major source of performance improvement.
LGNov 28, 2025
We Still Don't Understand High-Dimensional Bayesian OptimizationColin Doumont, Donney Fan, Natalie Maus et al.
High-dimensional spaces have challenged Bayesian optimization (BO). Existing methods aim to overcome this so-called curse of dimensionality by carefully encoding structural assumptions, from locality to sparsity to smoothness, into the optimization procedure. Surprisingly, we demonstrate that these approaches are outperformed by arguably the simplest method imaginable: Bayesian linear regression. After applying a geometric transformation to avoid boundary-seeking behavior, Gaussian processes with linear kernels match state-of-the-art performance on tasks with 60- to 6,000-dimensional search spaces. Linear models offer numerous advantages over their non-parametric counterparts: they afford closed-form sampling and their computation scales linearly with data, a fact we exploit on molecular optimization tasks with > 20,000 observations. Coupled with empirical analyses, our results suggest the need to depart from past intuitions about BO methods in high-dimensional spaces.
MLMay 27, 2025
Nearly Dimension-Independent Convergence of Mean-Field Black-Box Variational InferenceKyurae Kim, Yi-An Ma, Trevor Campbell et al.
We prove that, given a mean-field location-scale variational family, black-box variational inference (BBVI) with the reparametrization gradient converges at a rate that is nearly independent of explicit dimension dependence. Specifically, for a $d$-dimensional strongly log-concave and log-smooth target, the number of iterations for BBVI with a sub-Gaussian family to obtain a solution $ε$-close to the global optimum has a dimension dependence of $\mathrm{O}(\log d)$. This is a significant improvement over the $\mathrm{O}(d)$ dependence of full-rank location-scale families. For heavy-tailed families, we prove a weaker $\mathrm{O}(d^{2/k})$ dependence, where $k$ is the number of finite moments of the family. Additionally, if the Hessian of the target log-density is constant, the complexity is free of any explicit dimension dependence. We also prove that our bound on the gradient variance, which is key to our result, cannot be improved using only spectral bounds on the Hessian of the target log-density.
LGMar 11, 2025
Large Scale Multi-Task Bayesian Optimization with Large Language ModelsYimeng Zeng, Natalie Maus, Haydn Thomas Jones et al.
In multi-task Bayesian optimization, the goal is to leverage experience from optimizing existing tasks to improve the efficiency of optimizing new ones. While approaches using multi-task Gaussian processes or deep kernel transfer exist, the performance improvement is marginal when scaling beyond a moderate number of tasks. We introduce a novel approach leveraging large language models (LLMs) to learn from, and improve upon, previous optimization trajectories, scaling to approximately 1500 distinct tasks. Specifically, we propose a feedback loop in which an LLM is fine-tuned on the high quality solutions to specific tasks found by Bayesian optimization (BO). This LLM is then used to generate initialization points for future BO searches for new tasks. The trajectories of these new searches provide additional training data for fine-tuning the LLM, completing the loop. We evaluate our method on two distinct domains: database query optimization and antimicrobial peptide design. Results demonstrate that our approach creates a positive feedback loop, where the LLM's generated initializations gradually improve, leading to better optimization performance. As this feedback loop continues, we find that the LLM is eventually able to generate solutions to new tasks in just a few shots that are better than the solutions produced by "from scratch" by Bayesian optimization while simultaneously requiring significantly fewer oracle calls.
LGJun 6, 2024
Approximation-Aware Bayesian OptimizationNatalie Maus, Kyurae Kim, Geoff Pleiss et al.
High-dimensional Bayesian optimization (BO) tasks such as molecular design often require 10,000 function evaluations before obtaining meaningful results. While methods like sparse variational Gaussian processes (SVGPs) reduce computational requirements in these settings, the underlying approximations result in suboptimal data acquisitions that slow the progress of optimization. In this paper we modify SVGPs to better align with the goals of BO: targeting informed data acquisition rather than global posterior fidelity. Using the framework of utility-calibrated variational inference, we unify GP approximation and data acquisition into a joint optimization problem, thereby ensuring optimal decisions under a limited computational budget. Our approach can be used with any decision-theoretic acquisition function and is compatible with trust region methods like TuRBO. We derive efficient joint objectives for the expected improvement and knowledge gradient acquisition functions in both the standard and batch BO settings. Our approach outperforms standard SVGPs on high-dimensional benchmark tasks in control and molecular design.
LGJun 5, 2024
Zeroth-Order Fine-Tuning of LLMs with Extreme SparsityWentao Guo, Jikai Long, Yimeng Zeng et al.
Zeroth-order optimization (ZO) is a memory-efficient strategy for fine-tuning Large Language Models using only forward passes. However, the application of ZO fine-tuning in memory-constrained settings such as mobile phones and laptops is still challenging since full precision forward passes are infeasible. In this study, we address this limitation by integrating sparsity and quantization into ZO fine-tuning of LLMs. Specifically, we investigate the feasibility of fine-tuning an extremely small subset of LLM parameters using ZO. This approach allows the majority of un-tuned parameters to be quantized to accommodate the constraint of limited device memory. Our findings reveal that the pre-training process can identify a set of "sensitive parameters" that can guide the ZO fine-tuning of LLMs on downstream tasks. Our results demonstrate that fine-tuning 0.1% sensitive parameters in the LLM with ZO can outperform the full ZO fine-tuning performance, while offering wall-clock time speedup. Additionally, we show that ZO fine-tuning targeting these 0.1% sensitive parameters, combined with 4 bit quantization, enables efficient ZO fine-tuning of an Llama2-7B model on a GPU device with less than 8 GiB of memory and notably reduced latency.
LGJun 4, 2024
Understanding Stochastic Natural Gradient Variational InferenceKaiwen Wu, Jacob R. Gardner
Stochastic natural gradient variational inference (NGVI) is a popular posterior inference method with applications in various probabilistic models. Despite its wide usage, little is known about the non-asymptotic convergence rate in the \emph{stochastic} setting. We aim to lessen this gap and provide a better understanding. For conjugate likelihoods, we prove the first $\mathcal{O}(\frac{1}{T})$ non-asymptotic convergence rate of stochastic NGVI. The complexity is no worse than stochastic gradient descent (\aka black-box variational inference) and the rate likely has better constant dependency that leads to faster convergence in practice. For non-conjugate likelihoods, we show that stochastic NGVI with the canonical parameterization implicitly optimizes a non-convex objective. Thus, a global convergence rate of $\mathcal{O}(\frac{1}{T})$ is unlikely without some significant new understanding of optimizing the ELBO using natural gradients.
MLJun 3, 2024
Demystifying SGD with Doubly Stochastic GradientsKyurae Kim, Joohwan Ko, Yi-An Ma et al.
Optimization objectives in the form of a sum of intractable expectations are rising in importance (e.g., diffusion models, variational autoencoders, and many more), a setting also known as "finite sum with infinite data." For these problems, a popular strategy is to employ SGD with doubly stochastic gradients (doubly SGD): the expectations are estimated using the gradient estimator of each component, while the sum is estimated by subsampling over these estimators. Despite its popularity, little is known about the convergence properties of doubly SGD, except under strong assumptions such as bounded variance. In this work, we establish the convergence of doubly SGD with independent minibatching and random reshuffling under general conditions, which encompasses dependent component gradient estimators. In particular, for dependent estimators, our analysis allows fined-grained analysis of the effect correlations. As a result, under a per-iteration computational budget of $b \times m$, where $b$ is the minibatch size and $m$ is the number of Monte Carlo samples, our analysis suggests where one should invest most of the budget in general. Furthermore, we prove that random reshuffling (RR) improves the complexity dependence on the subsampling noise.
MLJan 19, 2024
Provably Scalable Black-Box Variational Inference with Structured Variational FamiliesJoohwan Ko, Kyurae Kim, Woo Chang Kim et al.
Variational families with full-rank covariance approximations are known not to work well in black-box variational inference (BBVI), both empirically and theoretically. In fact, recent computational complexity results for BBVI have established that full-rank variational families scale poorly with the dimensionality of the problem compared to e.g. mean-field families. This is particularly critical to hierarchical Bayesian models with local variables; their dimensionality increases with the size of the datasets. Consequently, one gets an iteration complexity with an explicit $\mathcal{O}(N^2)$ dependence on the dataset size $N$. In this paper, we explore a theoretical middle ground between mean-field variational families and full-rank families: structured variational families. We rigorously prove that certain scale matrix structures can achieve a better iteration complexity of $\mathcal{O}\left(N\right)$, implying better scaling with respect to $N$. We empirically verify our theoretical results on large-scale hierarchical models.
BMMay 25, 2023
Inverse Protein Folding Using Deep Bayesian OptimizationNatalie Maus, Yimeng Zeng, Daniel Allen Anderson et al.
Inverse protein folding -- the task of predicting a protein sequence from its backbone atom coordinates -- has surfaced as an important problem in the "top down", de novo design of proteins. Contemporary approaches have cast this problem as a conditional generative modelling problem, where a large generative model over protein sequences is conditioned on the backbone. While these generative models very rapidly produce promising sequences, independent draws from generative models may fail to produce sequences that reliably fold to the correct backbone. Furthermore, it is challenging to adapt pure generative approaches to other settings, e.g., when constraints exist. In this paper, we cast the problem of improving generated inverse folds as an optimization problem that we solve using recent advances in "deep" or "latent space" Bayesian optimization. Our approach consistently produces protein sequences with greatly reduced structural error to the target backbone structure as measured by TM score and RMSD while using fewer computational resources. Additionally, we demonstrate other advantages of an optimization-based approach to the problem, such as the ability to handle constraints.
LGMay 24, 2023
The Behavior and Convergence of Local Bayesian OptimizationKaiwen Wu, Kyurae Kim, Roman Garnett et al.
A recent development in Bayesian optimization is the use of local optimization strategies, which can deliver strong empirical performance on high-dimensional problems compared to traditional global strategies. The "folk wisdom" in the literature is that the focus on local optimization sidesteps the curse of dimensionality; however, little is known concretely about the expected behavior or convergence of Bayesian local optimization routines. We first study the behavior of the local approach, and find that the statistics of individual local solutions of Gaussian process sample paths are surprisingly good compared to what we would expect to recover from global methods. We then present the first rigorous analysis of such a Bayesian local optimization algorithm recently proposed by Müller et al. (2021), and derive convergence rates in both the noisy and noiseless settings.
LGMay 24, 2023
On the Convergence of Black-Box Variational InferenceKyurae Kim, Jisu Oh, Kaiwen Wu et al.
We provide the first convergence guarantee for full black-box variational inference (BBVI), also known as Monte Carlo variational inference. While preliminary investigations worked on simplified versions of BBVI (e.g., bounded domain, bounded support, only optimizing for the scale, and such), our setup does not need any such algorithmic modifications. Our results hold for log-smooth posterior densities with and without strong log-concavity and the location-scale variational family. Also, our analysis reveals that certain algorithm design choices commonly employed in practice, particularly, nonlinear parameterizations of the scale of the variational approximation, can result in suboptimal convergence rates. Fortunately, running BBVI with proximal stochastic gradient descent fixes these limitations, and thus achieves the strongest known convergence rate guarantees. We evaluate this theoretical insight by comparing proximal SGD against other standard implementations of BBVI on large-scale Bayesian inference problems.
LGJan 28, 2022
Local Latent Space Bayesian Optimization over Structured InputsNatalie Maus, Haydn T. Jones, Juston S. Moore et al.
Bayesian optimization over the latent spaces of deep autoencoder models (DAEs) has recently emerged as a promising new approach for optimizing challenging black-box functions over structured, discrete, hard-to-enumerate search spaces (e.g., molecules). Here the DAE dramatically simplifies the search space by mapping inputs into a continuous latent space where familiar Bayesian optimization tools can be more readily applied. Despite this simplification, the latent space typically remains high-dimensional. Thus, even with a well-suited latent space, these approaches do not necessarily provide a complete solution, but may rather shift the structured optimization problem to a high-dimensional one. In this paper, we propose LOL-BO, which adapts the notion of trust regions explored in recent work on high-dimensional Bayesian optimization to the structured setting. By reformulating the encoder to function as both an encoder for the DAE globally and as a deep kernel for the surrogate model within a trust region, we better align the notion of local optimization in the latent space with local optimization in the input space. LOL-BO achieves as much as 20 times improvement over state-of-the-art latent space Bayesian optimization methods across six real-world benchmarks, demonstrating that improvement in optimization strategies is as important as developing better DAE models.
LGJul 8, 2021
Scaling Gaussian Processes with Derivative Information Using Variational InferenceMisha Padidar, Xinran Zhu, Leo Huang et al.
Gaussian processes with derivative information are useful in many settings where derivative information is available, including numerous Bayesian optimization and regression tasks that arise in the natural sciences. Incorporating derivative observations, however, comes with a dominating $O(N^3D^3)$ computational cost when training on $N$ points in $D$ input dimensions. This is intractable for even moderately sized problems. While recent work has addressed this intractability in the low-$D$ setting, the high-$N$, high-$D$ setting is still unexplored and of great value, particularly as machine learning problems increasingly become high dimensional. In this paper, we introduce methods to achieve fully scalable Gaussian process regression with derivatives using variational inference. Analogous to the use of inducing values to sparsify the labels of a training set, we introduce the concept of inducing directional derivatives to sparsify the partial derivative information of a training set. This enables us to construct a variational posterior that incorporates derivative information but whose size depends neither on the full dataset size $N$ nor the full dimensionality $D$. We demonstrate the full scalability of our approach on a variety of tasks, ranging from a high dimensional stellarator fusion regression task to training graph convolutional neural networks on Pubmed using Bayesian optimization. Surprisingly, we find that our approach can improve regression performance even in settings where only label data is available.
LGJul 1, 2021
Preconditioning for Scalable Gaussian Process Hyperparameter OptimizationJonathan Wenger, Geoff Pleiss, Philipp Hennig et al.
Gaussian process hyperparameter optimization requires linear solves with, and log-determinants of, large kernel matrices. Iterative numerical techniques are becoming popular to scale to larger datasets, relying on the conjugate gradient method (CG) for the linear solves and stochastic trace estimation for the log-determinant. This work introduces new algorithmic and theoretical insights for preconditioning these computations. While preconditioning is well understood in the context of CG, we demonstrate that it can also accelerate convergence and reduce variance of the estimates for the log-determinant and its derivative. We prove general probabilistic error bounds for the preconditioned computation of the log-determinant, log-marginal likelihood and its derivatives. Additionally, we derive specific rates for a range of kernel-preconditioner combinations, showing that up to exponential convergence can be achieved. Our theoretical results enable provably efficient optimization of kernel hyperparameters, which we validate empirically on large-scale benchmark problems. There our approach accelerates training by up to an order of magnitude.
LGJun 29, 2020
Efficient Nonmyopic Bayesian Optimization via One-Shot Multi-Step TreesShali Jiang, Daniel R. Jiang, Maximilian Balandat et al.
Bayesian optimization is a sequential decision making framework for optimizing expensive-to-evaluate black-box functions. Computing a full lookahead policy amounts to solving a highly intractable stochastic dynamic program. Myopic approaches, such as expected improvement, are often adopted in practice, but they ignore the long-term impact of the immediate decision. Existing nonmyopic approaches are mostly heuristic and/or computationally expensive. In this paper, we provide the first efficient implementation of general multi-step lookahead Bayesian optimization, formulated as a sequence of nested optimization problems within a multi-step scenario tree. Instead of solving these problems in a nested way, we equivalently optimize all decision variables in the full tree jointly, in a ``one-shot'' fashion. Combining this with an efficient method for implementing multi-step Gaussian process ``fantasization,'' we demonstrate that multi-step expected improvement is computationally tractable and exhibits performance superior to existing methods on a wide range of benchmarks.
LGJun 19, 2020
Fast Matrix Square Roots with Applications to Gaussian Processes and Bayesian OptimizationGeoff Pleiss, Martin Jankowiak, David Eriksson et al.
Matrix square roots and their inverses arise frequently in machine learning, e.g., when sampling from high-dimensional Gaussians $\mathcal{N}(\mathbf 0, \mathbf K)$ or whitening a vector $\mathbf b$ against covariance matrix $\mathbf K$. While existing methods typically require $O(N^3)$ computation, we introduce a highly-efficient quadratic-time algorithm for computing $\mathbf K^{1/2} \mathbf b$, $\mathbf K^{-1/2} \mathbf b$, and their derivatives through matrix-vector multiplication (MVMs). Our method combines Krylov subspace methods with a rational approximation and typically achieves $4$ decimal places of accuracy with fewer than $100$ MVMs. Moreover, the backward pass requires little additional computation. We demonstrate our method's applicability on matrices as large as $50,\!000 \times 50,\!000$ - well beyond traditional methods - with little approximation error. Applying this increased scalability to variational Gaussian processes, Bayesian optimization, and Gibbs sampling results in more powerful models with higher accuracy.
MLFeb 21, 2020
Deep Sigma Point ProcessesMartin Jankowiak, Geoff Pleiss, Jacob R. Gardner
We introduce Deep Sigma Point Processes, a class of parametric models inspired by the compositional structure of Deep Gaussian Processes (DGPs). Deep Sigma Point Processes (DSPPs) retain many of the attractive features of (variational) DGPs, including mini-batch training and predictive uncertainty that is controlled by kernel basis functions. Importantly, since DSPPs admit a simple maximum likelihood inference procedure, the resulting predictive distributions are not degraded by any posterior approximations. In an extensive empirical comparison on univariate and multivariate regression tasks we find that the resulting predictive distributions are significantly better calibrated than those obtained with other probabilistic methods for scalable regression, including variational DGPs--often by as much as a nat per datapoint.
MLOct 16, 2019
Parametric Gaussian Process RegressorsMartin Jankowiak, Geoff Pleiss, Jacob R. Gardner
The combination of inducing point methods with stochastic variational inference has enabled approximate Gaussian Process (GP) inference on large datasets. Unfortunately, the resulting predictive distributions often exhibit substantially underestimated uncertainties. Notably, in the regression case the predictive variance is typically dominated by observation noise, yielding uncertainty estimates that make little use of the input-dependent function uncertainty that makes GP priors attractive. In this work we propose two simple methods for scalable GP regression that address this issue and thus yield substantially improved predictive uncertainties. The first applies variational inference to FITC (Fully Independent Training Conditional; Snelson et.~al.~2006). The second bypasses posterior approximations and instead directly targets the posterior predictive distribution. In an extensive empirical comparison with a number of alternative methods for scalable GP regression, we find that the resulting predictive distributions exhibit significantly better calibrated uncertainties and higher log likelihoods--often by as much as half a nat per datapoint.
LGMay 17, 2019
Simple Black-box Adversarial AttacksChuan Guo, Jacob R. Gardner, Yurong You et al.
We propose an intriguingly simple method for the construction of adversarial images in the black-box setting. In constrast to the white-box scenario, constructing black-box adversarial images has the additional constraint on query budget, and efficient attacks remain an open problem to date. With only the mild assumption of continuous-valued confidence scores, our highly query-efficient algorithm utilizes the following simple iterative principle: we randomly sample a vector from a predefined orthonormal basis and either add or subtract it to the target image. Despite its simplicity, the proposed method can be used for both untargeted and targeted attacks -- resulting in previously unprecedented query efficiency in both settings. We demonstrate the efficacy and efficiency of our algorithm on several real world settings including the Google Cloud Vision API. We argue that our proposed algorithm should serve as a strong baseline for future black-box attacks, in particular because it is extremely fast and its implementation requires less than 20 lines of PyTorch code.
LGMar 19, 2019
Exact Gaussian Processes on a Million Data PointsKe Alexander Wang, Geoff Pleiss, Jacob R. Gardner et al.
Gaussian processes (GPs) are flexible non-parametric models, with a capacity that grows with the available data. However, computational constraints with standard inference procedures have limited exact GPs to problems with fewer than about ten thousand training points, necessitating approximations for larger datasets. In this paper, we develop a scalable approach for exact GPs that leverages multi-GPU parallelization and methods like linear conjugate gradients, accessing the kernel matrix only through matrix multiplication. By partitioning and distributing kernel matrix multiplies, we demonstrate that an exact GP can be trained on over a million points, a task previously thought to be impossible with current computing hardware, in less than 2 hours. Moreover, our approach is generally applicable, without constraints to grid data or specific kernel classes. Enabled by this scalability, we perform the first-ever comparison of exact GPs against scalable GP approximations on datasets with $10^4 \!-\! 10^6$ data points, showing dramatic performance improvements.
LGSep 28, 2018
GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU AccelerationJacob R. Gardner, Geoff Pleiss, David Bindel et al.
Despite advances in scalable models, the inference tools used for Gaussian processes (GPs) have yet to fully capitalize on developments in computing hardware. We present an efficient and general approach to GP inference based on Blackbox Matrix-Matrix multiplication (BBMM). BBMM inference uses a modified batched version of the conjugate gradients algorithm to derive all terms for training and inference in a single call. BBMM reduces the asymptotic complexity of exact GP inference from $O(n^3)$ to $O(n^2)$. Adapting this algorithm to scalable approximations and complex GP models simply requires a routine for efficient matrix-matrix multiplication with the kernel and its derivative. In addition, BBMM uses a specialized preconditioner to substantially speed up convergence. In experiments we show that BBMM effectively uses GPU hardware to dramatically accelerate both exact GP inference and scalable approximations. Additionally, we provide GPyTorch, a software platform for scalable GP inference via BBMM, built on PyTorch.
LGMar 16, 2018
Constant-Time Predictive Distributions for Gaussian ProcessesGeoff Pleiss, Jacob R. Gardner, Kilian Q. Weinberger et al.
One of the most compelling features of Gaussian process (GP) regression is its ability to provide well-calibrated posterior distributions. Recent advances in inducing point methods have sped up GP marginal likelihood and posterior mean computations, leaving posterior covariance estimation and sampling as the remaining computational bottlenecks. In this paper we address these shortcomings by using the Lanczos algorithm to rapidly approximate the predictive covariance matrix. Our approach, which we refer to as LOVE (LanczOs Variance Estimates), substantially improves time and space complexity. In our experiments, LOVE computes covariances up to 2,000 times faster and draws samples 18,000 times faster than existing methods, all without sacrificing accuracy.
LGFeb 24, 2018
Product Kernel Interpolation for Scalable Gaussian ProcessesJacob R. Gardner, Geoff Pleiss, Ruihan Wu et al.
Recent work shows that inference for Gaussian processes can be performed efficiently using iterative methods that rely only on matrix-vector multiplications (MVMs). Structured Kernel Interpolation (SKI) exploits these techniques by deriving approximate kernels with very fast MVMs. Unfortunately, such strategies suffer badly from the curse of dimensionality. We develop a new technique for MVM based learning that exploits product kernel structure. We demonstrate that this technique is broadly applicable, resulting in linear rather than exponential runtime with dimension for SKI, as well as state-of-the-art asymptotic complexity for multi-task GPs.
LGNov 19, 2015
Deep Manifold Traversal: Changing Labels with Convolutional FeaturesJacob R. Gardner, Paul Upchurch, Matt J. Kusner et al.
Many tasks in computer vision can be cast as a "label changing" problem, where the goal is to make a semantic change to the appearance of an image or some subject in an image in order to alter the class membership. Although successful task-specific methods have been developed for some label changing applications, to date no general purpose method exists. Motivated by this we propose deep manifold traversal, a method that addresses the problem in its most general form: it first approximates the manifold of natural images then morphs a test image along a traversal path away from a source class and towards a target class while staying near the manifold throughout. The resulting algorithm is surprisingly effective and versatile. It is completely data driven, requiring only an example set of images from the desired source and target domains. We demonstrate deep manifold traversal on highly diverse label changing tasks: changing an individual's appearance (age and hair color), changing the season of an outdoor image, and transforming a city skyline towards nighttime.
LGJan 26, 2015
Compressed Support Vector MachinesZhixiang Xu, Jacob R. Gardner, Stephen Tyree et al.
Support vector machines (SVM) can classify data sets along highly non-linear decision boundaries because of the kernel-trick. This expressiveness comes at a price: During test-time, the SVM classifier needs to compute the kernel inner-product between a test sample and all support vectors. With large training data sets, the time required for this computation can be substantial. In this paper, we introduce a post-processing algorithm, which compresses the learned SVM model by reducing and optimizing support vectors. We evaluate our algorithm on several medium-scaled real-world data sets, demonstrating that it maintains high test accuracy while reducing the test-time evaluation cost by several orders of magnitude---in some cases from hours to seconds. It is fair to say that most of the work in this paper was previously been invented by Burges and Schölkopf almost 20 years ago. For most of the time during which we conducted this research, we were unaware of this prior work. However, in the past two decades, computing power has increased drastically, and we can therefore provide empirical insights that were not possible in their original paper.
MLJan 16, 2015
Differentially Private Bayesian OptimizationMatt J. Kusner, Jacob R. Gardner, Roman Garnett et al.
Bayesian optimization is a powerful tool for fine-tuning the hyper-parameters of a wide variety of machine learning models. The success of machine learning has led practitioners in diverse real-world settings to learn classifiers for practical problems. As machine learning becomes commonplace, Bayesian optimization becomes an attractive method for practitioners to automate the process of classifier hyper-parameter tuning. A key observation is that the data used for tuning models in these settings is often sensitive. Certain data such as genetic predisposition, personal email statistics, and car accident history, if not properly private, may be at risk of being inferred from Bayesian optimization outputs. To address this, we introduce methods for releasing the best hyper-parameters and classifier accuracy privately. Leveraging the strong theoretical guarantees of differential privacy and known Bayesian optimization convergence bounds, we prove that under a GP assumption these private quantities are also near-optimal. Finally, even if this assumption is not satisfied, we can use different smoothness guarantees to protect privacy.
LGApr 3, 2014
Parallel Support Vector Machines in PracticeStephen Tyree, Jacob R. Gardner, Kilian Q. Weinberger et al.
In this paper, we evaluate the performance of various parallel optimization methods for Kernel Support Vector Machines on multicore CPUs and GPUs. In particular, we provide the first comparison of algorithms with explicit and implicit parallelization. Most existing parallel implementations for multi-core or GPU architectures are based on explicit parallelization of Sequential Minimal Optimization (SMO)---the programmers identified parallelizable components and hand-parallelized them, specifically tuned for a particular architecture. We compare these approaches with each other and with implicitly parallelized algorithms---where the algorithm is expressed such that most of the work is done within few iterations with large dense linear algebra operations. These can be computed with highly-optimized libraries, that are carefully parallelized for a large variety of parallel platforms. We highlight the advantages and disadvantages of both approaches and compare them on various benchmark data sets. We find an approximate implicitly parallel algorithm which is surprisingly efficient, permits a much simpler implementation, and leads to unprecedented speedups in SVM training.