LGMar 21, 2022
Preference Exploration for Efficient Bayesian Optimization with Multiple OutcomesZhiyuan Jerry Lin, Raul Astudillo, Peter I. Frazier et al. · gatech, stanford
We consider Bayesian optimization of expensive-to-evaluate experiments that generate vector-valued outcomes over which a decision-maker (DM) has preferences. These preferences are encoded by a utility function that is not known in closed form but can be estimated by asking the DM to express preferences over pairs of outcome vectors. To address this problem, we develop Bayesian optimization with preference exploration, a novel framework that alternates between interactive real-time preference learning with the DM via pairwise comparisons between outcomes, and Bayesian optimization with a learned compositional model of DM utility and outcomes. Within this framework, we propose preference exploration strategies specifically designed for this task, and demonstrate their performance via extensive simulation studies.
LGMar 28, 2023
qEUBO: A Decision-Theoretic Acquisition Function for Preferential Bayesian OptimizationRaul Astudillo, Zhiyuan Jerry Lin, Eytan Bakshy et al. · gatech, stanford
Preferential Bayesian optimization (PBO) is a framework for optimizing a decision maker's latent utility function using preference feedback. This work introduces the expected utility of the best option (qEUBO) as a novel acquisition function for PBO. When the decision maker's responses are noise-free, we show that qEUBO is one-step Bayes optimal and thus equivalent to the popular knowledge gradient acquisition function. We also show that qEUBO enjoys an additive constant approximation guarantee to the one-step Bayes-optimal policy when the decision maker's responses are corrupted by noise. We provide an extensive evaluation of qEUBO and demonstrate that it outperforms the state-of-the-art acquisition functions for PBO across many settings. Finally, we show that, under sufficient regularity conditions, qEUBO's Bayesian simple regret converges to zero at a rate $o(1/n)$ as the number of queries, $n$, goes to infinity. In contrast, we show that simple regret under qEI, a popular acquisition function for standard BO often used for PBO, can fail to converge to zero. Enjoying superior performance, simple computation, and a grounded decision-theoretic justification, qEUBO is a promising acquisition function for PBO.
LGJan 29, 2023
Smooth Non-Stationary BanditsSu Jia, Qian Xie, Nathan Kallus et al.
In many applications of online decision making, the environment is non-stationary and it is therefore crucial to use bandit algorithms that handle changes. Most existing approaches are designed to protect against non-smooth changes, constrained only by total variation or Lipschitzness over time. However, in practice, environments often change {\em smoothly}, so such algorithms may incur higher-than-necessary regret. We study a non-stationary bandits problem where each arm's mean reward sequence can be embedded into a $β$-Hölder function, i.e., a function that is $(β-1)$-times Lipschitz-continuously differentiable. The non-stationarity becomes more smooth as $β$ increases. When $β=1$, this corresponds to the non-smooth regime, where \cite{besbes2014stochastic} established a minimax regret of $\tilde Θ(T^{2/3})$. We show the first separation between the smooth (i.e., $β\ge 2$) and non-smooth (i.e., $β=1$) regimes by presenting a policy with $\tilde O(k^{4/5} T^{3/5})$ regret on any $k$-armed, $2$-Hölder instance. We complement this result by showing that the minimax regret on the $β$-Hölder family of instances is $Ω(T^{(β+1)/(2β+1)})$ for any integer $β\ge 1$. This matches our upper bound for $β=2$ up to logarithmic factors. Furthermore, we validated the effectiveness of our policy through a comprehensive numerical study using real-world click-through rate data.
MLNov 3, 2023
Bayesian Optimization of Function Networks with Partial EvaluationsPoompol Buathong, Jiayue Wan, Raul Astudillo et al.
Bayesian optimization is a powerful framework for optimizing functions that are expensive or time-consuming to evaluate. Recent work has considered Bayesian optimization of function networks (BOFN), where the objective function is given by a network of functions, each taking as input the output of previous nodes in the network as well as additional parameters. Leveraging this network structure has been shown to yield significant performance improvements. Existing BOFN algorithms for general-purpose networks evaluate the full network at each iteration. However, many real-world applications allow for evaluating nodes individually. To exploit this, we propose a novel knowledge gradient acquisition function that chooses which node and corresponding inputs to evaluate in a cost-aware manner, thereby reducing query costs by evaluating only on a part of the network at each step. We provide an efficient approach to optimizing our acquisition function and show that it outperforms existing BOFN methods and other benchmarks across several synthetic and real-world problems. Our acquisition function is the first to enable cost-aware optimization of a broad class of function networks.
MLFeb 16, 2016Code
Parallel Bayesian Global Optimization of Expensive FunctionsJialei Wang, Scott C. Clark, Eric Liu et al.
We consider parallel global optimization of derivative-free expensive-to-evaluate functions, and propose an efficient method based on stochastic approximation for implementing a conceptual Bayesian optimization algorithm proposed by Ginsbourger et al. (2007). At the heart of this algorithm is maximizing the information criterion called the "multi-points expected improvement'', or the q-EI. To accomplish this, we use infinitessimal perturbation analysis (IPA) to construct a stochastic gradient estimator and show that this estimator is unbiased. We also show that the stochastic gradient ascent algorithm using the constructed gradient estimator converges to a stationary point of the q-EI surface, and therefore, as the number of multiple starts of the gradient ascent algorithm and the number of steps for each start grow large, the one-step Bayes optimal set of points is recovered. We show in numerical experiments that our method for maximizing the q-EI is faster than methods based on closed-form evaluation using high-dimensional integration, when considering many parallel function evaluations, and is comparable in speed when considering few. We also show that the resulting one-step Bayes optimal algorithm for parallel global optimization finds high-quality solutions with fewer evaluations than a heuristic based on approximately maximizing the q-EI. A high-quality open source implementation of this algorithm is available in the open source Metrics Optimization Engine (MOE).
24.3LGMay 7
Better Protein Function Prediction by Modeling Survivorship BiasZhongmou Chao, Poompol Buathong, Ekaterina Selivanovitch et al.
Protein sequence data from nature exhibits survivorship bias: we only observe data from those organisms that survive and reproduce, while non-functional protein mutations are eliminated by natural selection. Thus, predicting whether a protein sequence is functional often requires learning from positive examples alone. While positive-unlabeled (PU) learning frameworks offer a generic solution to this problem, existing PU methods ignore the evolutionary processes that shape sequence observability and cause survivorship bias. Consider a sequence that is one mutation away from a commonly-observed protein variant in a well-surveilled organism. If the sequence were functional, it would likely be observed. If it is not observed, this suggests non-functionality. In contrast, sequences that are unlikely to arise through mutation may be missing simply because they never arose. Thus, these two kinds of missing sequences should be treated differently when training models. In this work, we propose Evo-PU, a PU learning framework that uses a scientific understanding of nucleotide mutation to model survivorship bias for well-surveilled single-organism sequence data. On three prediction tasks using single-organism uniform-coverage surveillance data -- predicting results from held-out influenza and respiratory syncytial virus (RSV) mutagenesis studies, and predicting future SARS-CoV-2 variants -- Evo-PU outperforms standard PU learning, one-class classification (OCC), and protein language models (PLMs). On prediction tasks from multi-organism ProteinGym datasets with more heterogeneous surveillance coverage, we identify opportunities to generalize our approach.
LGJul 16, 2025
Cost-aware Stopping for Bayesian OptimizationQian Xie, Linda Cai, Alexander Terenin et al.
In automated machine learning, scientific discovery, and other applications of Bayesian optimization, deciding when to stop evaluating expensive black-box functions is an important practical consideration. While several adaptive stopping rules have been proposed, in the cost-aware setting they lack guarantees ensuring they stop before incurring excessive function evaluation costs. We propose a cost-aware stopping rule for Bayesian optimization that adapts to varying evaluation costs and is free of heuristic tuning. Our rule is grounded in a theoretical connection to state-of-the-art cost-aware acquisition functions, namely the Pandora's Box Gittins Index (PBGI) and log expected improvement per cost. We prove a theoretical guarantee bounding the expected cumulative evaluation cost incurred by our stopping rule when paired with these two acquisition functions. In experiments on synthetic and empirical tasks, including hyperparameter optimization and neural architecture size search, we show that combining our stopping rule with the PBGI acquisition function usually matches or outperforms other acquisition-function--stopping-rule pairs in terms of cost-adjusted simple regret, a metric capturing trade-offs between solution quality and cumulative evaluation cost.
CLOct 29, 2025
LISTEN to Your Preferences: An LLM Framework for Multi-Objective SelectionAdam S. Jovine, Tinghan Ye, Francis Bahk et al.
Human experts often struggle to select the best option from a large set of items with multiple competing objectives, a process bottlenecked by the difficulty of formalizing complex, implicit preferences. To address this, we introduce LISTEN, a framework that leverages a Large Language Model (LLM) as a zero-shot preference oracle, guided only by an expert's high-level priorities in natural language. To operate within LLM constraints like context windows and inference costs, we propose two iterative algorithms: LISTEN-U, which uses the LLM to refine a parametric utility function, and LISTEN-T, a non-parametric method that performs tournament-style selections over small batches of solutions. Evaluated on diverse tasks including flight booking, shopping, and exam scheduling, our results show LISTEN-U excels when preferences are parametrically aligned (a property we measure with a novel concordance metric), while LISTEN-T offers more robust performance. This work explores a promising direction for steering complex multi-objective decisions directly with natural language, reducing the cognitive burden of traditional preference elicitation.
MLJun 13, 2025
Fast Bayesian Optimization of Function Networks with Partial EvaluationsPoompol Buathong, Peter I. Frazier
Bayesian optimization of function networks (BOFN) is a framework for optimizing expensive-to-evaluate objective functions structured as networks, where some nodes' outputs serve as inputs for others. Many real-world applications, such as manufacturing and drug discovery, involve function networks with additional properties - nodes that can be evaluated independently and incur varying costs. A recent BOFN variant, p-KGFN, leverages this structure and enables cost-aware partial evaluations, selectively querying only a subset of nodes at each iteration. p-KGFN reduces the number of expensive objective function evaluations needed but has a large computational overhead: choosing where to evaluate requires optimizing a nested Monte Carlo-based acquisition function for each node in the network. To address this, we propose an accelerated p-KGFN algorithm that reduces computational overhead with only a modest loss in query efficiency. Key to our approach is generation of node-specific candidate inputs for each node in the network via one inexpensive global Monte Carlo simulation. Numerical experiments show that our method maintains competitive query efficiency while achieving up to a 16x speedup over the original p-KGFN algorithm.
LGJun 28, 2024
Cost-aware Bayesian Optimization via the Pandora's Box Gittins IndexQian Xie, Raul Astudillo, Peter I. Frazier et al.
Bayesian optimization is a technique for efficiently optimizing unknown functions in a black-box manner. To handle practical settings where gathering data requires use of finite resources, it is desirable to explicitly incorporate function evaluation costs into Bayesian optimization policies. To understand how to do so, we develop a previously-unexplored connection between cost-aware Bayesian optimization and the Pandora's Box problem, a decision problem from economics. The Pandora's Box problem admits a Bayesian-optimal solution based on an expression called the Gittins index, which can be reinterpreted as an acquisition function. We study the use of this acquisition function for cost-aware Bayesian optimization, and demonstrate empirically that it performs well, particularly in medium-high dimensions. We further show that this performance carries over to classical Bayesian optimization without explicit evaluation costs. Our work constitutes a first step towards integrating techniques from Gittins index theory into Bayesian optimization.
LGJun 12, 2024
Asymptotically Optimal Regret for Black-Box Predict-then-OptimizeSamuel Tan, Peter I. Frazier
We consider the predict-then-optimize paradigm for decision-making in which a practitioner (1) trains a supervised learning model on historical data of decisions, contexts, and rewards, and then (2) uses the resulting model to make future binary decisions for new contexts by finding the decision that maximizes the model's predicted reward. This approach is common in industry. Past analysis assumes that rewards are observed for all actions for all historical contexts, which is possible only in problems with special structure. Motivated by problems from ads targeting and recommender systems, we study new black-box predict-then-optimize problems that lack this special structure and where we only observe the reward from the action taken. We present a novel loss function, which we call Empirical Soft Regret (ESR), designed to significantly improve reward when used in training compared to classical accuracy-based metrics like mean-squared error. This loss function targets the regret achieved when taking a suboptimal decision; because the regret is generally not differentiable, we propose a differentiable "soft" regret term that allows the use of neural networks and other flexible machine learning models dependent on gradient-based training. In the particular case of paired data, we show theoretically that optimizing our loss function yields asymptotically optimal regret within the class of supervised learning models. We also show our approach significantly outperforms state-of-the-art algorithms on real-world decision-making problems in news recommendation and personalized healthcare compared to benchmark methods from contextual bandits and conditional average treatment effect estimation.
LGMar 29, 2022
Near-optimality for infinite-horizon restless bandits with many armsXiangyu Zhang, Peter I. Frazier
Restless bandits are an important class of problems with applications in recommender systems, active learning, revenue management and other areas. We consider infinite-horizon discounted restless bandits with many arms where a fixed proportion of arms may be pulled in each period and where arms share a finite state space. Although an average-case-optimal policy can be computed via stochastic dynamic programming, the computation required grows exponentially with the number of arms $N$. Thus, it is important to find scalable policies that can be computed efficiently for large $N$ and that are near optimal in this regime, in the sense that the optimality gap (i.e. the loss of expected performance against an optimal policy) per arm vanishes for large $N$. However, the most popular approach, the Whittle index, requires a hard-to-verify indexability condition to be well-defined and another hard-to-verify condition to guarantee a $o(N)$ optimality gap. We present a method resolving these difficulties. By replacing a global Lagrange multiplier used by the Whittle index with a sequence of Lagrangian multipliers, one per time period up to a finite truncation point, we derive a class of policies, called fluid-balance policies, that have a $O(\sqrt{N})$ optimality gap. Unlike the Whittle index, fluid-balance policies do not require indexability to be well-defined and their $O(\sqrt{N})$ optimality gap bound holds universally without sufficient conditions. We also demonstrate empirically that fluid-balance policies provide state-of-the-art performance on specific problems.
LGJan 2, 2022
Thinking inside the box: A tutorial on grey-box Bayesian optimizationRaul Astudillo, Peter I. Frazier
Bayesian optimization (BO) is a framework for global optimization of expensive-to-evaluate objective functions. Classical BO methods assume that the objective function is a black box. However, internal information about objective function computation is often available. For example, when optimizing a manufacturing line's throughput with simulation, we observe the number of parts waiting at each workstation, in addition to the overall throughput. Recent BO methods leverage such internal information to dramatically improve performance. We call these "grey-box" BO methods because they treat objective computation as partially observable and even modifiable, blending the black-box approach with so-called "white-box" first-principles knowledge of objective function computation. This tutorial describes these methods, focusing on BO of composite objective functions, where one can observe and selectively evaluate individual constituents that feed into the overall objective; and multi-fidelity BO, where one can evaluate cheaper approximations of the objective function by varying parameters of the evaluation oracle.
LGDec 31, 2021
Bayesian Optimization of Function NetworksRaul Astudillo, Peter I. Frazier
We consider Bayesian optimization of the output of a network of functions, where each function takes as input the output of its parent nodes, and where the network takes significant time to evaluate. Such problems arise, for example, in reinforcement learning, engineering design, and manufacturing. While the standard Bayesian optimization approach observes only the final output, our approach delivers greater query efficiency by leveraging information that the former ignores: intermediate output within the network. This is achieved by modeling the nodes of the network using Gaussian processes and choosing the points to evaluate using, as our acquisition function, the expected improvement computed with respect to the implied posterior on the objective. Although the non-Gaussian nature of this posterior prevents computing our acquisition function in closed form, we show that it can be efficiently maximized via sample average approximation. In addition, we prove that our method is asymptotically consistent, meaning that it finds a globally optimal solution as the number of evaluations grows to infinity, thus generalizing previously known convergence results for the expected improvement. Notably, this holds even though our method might not evaluate the domain densely, instead leveraging problem structure to leave regions unexplored. Finally, we show that our approach dramatically outperforms standard Bayesian optimization methods in several synthetic and real-world problems.
MLDec 6, 2021
Two-step Lookahead Bayesian Optimization with Inequality ConstraintsYunxiang Zhang, Xiangyu Zhang, Peter I. Frazier
Recent advances in computationally efficient non-myopic Bayesian optimization (BO) improve query efficiency over traditional myopic methods like expected improvement while only modestly increasing computational cost. These advances have been largely limited, however, to unconstrained optimization. For constrained optimization, the few existing non-myopic BO methods require heavy computation. For instance, one existing non-myopic constrained BO method [Lam and Willcox, 2017] relies on computationally expensive unreliable brute-force derivative-free optimization of a Monte Carlo rollout acquisition function. Methods that use the reparameterization trick for more efficient derivative-based optimization of non-myopic acquisition functions in the unconstrained setting, like sample average approximation and infinitesimal perturbation analysis, do not extend: constraints introduce discontinuities in the sampled acquisition function surface that hinder its optimization. Moreover, we argue here that being non-myopic is even more important in constrained problems because fear of violating constraints pushes myopic methods away from sampling the boundary between feasible and infeasible regions, slowing the discovery of optimal solutions with tight constraints. In this paper, we propose a computationally efficient two-step lookahead constrained Bayesian optimization acquisition function (2-OPT-C) supporting both sequential and batch settings. To enable fast acquisition function optimization, we develop a novel likelihood-ratio-based unbiased estimator of the gradient of the two-step optimal acquisition function that does not use the reparameterization trick. In numerical experiments, 2-OPT-C typically improves query efficiency by 2x or more over previous methods, and in some cases by 10x or more.
LGNov 12, 2021
Multi-Step Budgeted Bayesian Optimization with Unknown Evaluation CostsRaul Astudillo, Daniel R. Jiang, Maximilian Balandat et al.
Bayesian optimization (BO) is a sample-efficient approach to optimizing costly-to-evaluate black-box functions. Most BO methods ignore how evaluation costs may vary over the optimization domain. However, these costs can be highly heterogeneous and are often unknown in advance. This occurs in many practical settings, such as hyperparameter tuning of machine learning algorithms or physics-based simulation optimization. Moreover, those few existing methods that acknowledge cost heterogeneity do not naturally accommodate a budget constraint on the total evaluation cost. This combination of unknown costs and a budget constraint introduces a new dimension to the exploration-exploitation trade-off, where learning about the cost incurs the cost itself. Existing methods do not reason about the various trade-offs of this problem in a principled way, leading often to poor performance. We formalize this claim by proving that the expected improvement and the expected improvement per unit of cost, arguably the two most widely used acquisition functions in practice, can be arbitrarily inferior with respect to the optimal non-myopic policy. To overcome the shortcomings of existing approaches, we propose the budgeted multi-step expected improvement, a non-myopic acquisition function that generalizes classical expected improvement to the setting of heterogeneous and unknown evaluation costs. Finally, we show that our acquisition function outperforms existing methods in a variety of synthetic and real problems.
OCJul 25, 2021
Restless Bandits with Many Arms: Beating the Central Limit TheoremXiangyu Zhang, Peter I. Frazier
We consider finite-horizon restless bandits with multiple pulls per period, which play an important role in recommender systems, active learning, revenue management, and many other areas. While an optimal policy can be computed, in principle, using dynamic programming, the computation required scales exponentially in the number of arms $N$. Thus, there is substantial value in understanding the performance of index policies and other policies that can be computed efficiently for large $N$. We study the growth of the optimality gap, i.e., the loss in expected performance compared to an optimal policy, for such policies in a classical asymptotic regime proposed by Whittle in which $N$ grows while holding constant the fraction of arms that can be pulled per period. Intuition from the Central Limit Theorem and the tightest previous theoretical bounds suggest that this optimality gap should grow like $O(\sqrt{N})$. Surprisingly, we show that it is possible to outperform this bound. We characterize a non-degeneracy condition and a wide class of novel practically-computable policies, called fluid-priority policies, in which the optimality gap is $O(1)$. These include most widely-used index policies. When this non-degeneracy condition does not hold, we show that fluid-priority policies nevertheless have an optimality gap that is $O(\sqrt{N})$, significantly generalizing the class of policies for which convergence rates are known. We demonstrate that fluid-priority policies offer state-of-the-art performance on a collection of restless bandit problems in numerical experiments.
MLNov 14, 2019
Multi-Attribute Bayesian Optimization With Interactive Preference LearningRaul Astudillo, Peter I. Frazier
We consider black-box global optimization of time-consuming-to-evaluate functions on behalf of a decision-maker (DM) whose preferences must be learned. Each feasible design is associated with a time-consuming-to-evaluate vector of attributes and each vector of attributes is assigned a utility by the DM's utility function, which may be learned approximately using preferences expressed over pairs of attribute vectors. Past work has used a point estimate of this utility function as if it were error-free within single-objective optimization. However, utility estimation errors may yield a poor suggested design. Furthermore, this approach produces a single suggested "best" design, whereas DMs often prefer to choose from a menu. We propose a novel multi-attribute Bayesian optimization with preference learning approach. Our approach acknowledges the uncertainty in preference estimation and implicitly chooses designs to evaluate that are good not just for a single estimated utility function but a range of likely ones. The outcome of our approach is a menu of designs and evaluated attributes from which the DM makes a final selection. We demonstrate the value and flexibility of our approach in a variety of experiments.
MLJun 4, 2019
Bayesian Optimization of Composite FunctionsRaul Astudillo, Peter I. Frazier
We consider optimization of composite objective functions, i.e., of the form $f(x)=g(h(x))$, where $h$ is a black-box derivative-free expensive-to-evaluate function with vector-valued outputs, and $g$ is a cheap-to-evaluate real-valued function. While these problems can be solved with standard Bayesian optimization, we propose a novel approach that exploits the composite structure of the objective function to substantially improve sampling efficiency. Our approach models $h$ using a multi-output Gaussian process and chooses where to sample using the expected improvement evaluated on the implied non-Gaussian posterior on $f$, which we call expected improvement for composite functions (\ei). Although \ei\ cannot be computed in closed form, we provide a novel stochastic gradient estimator that allows its efficient maximization. We also show that our approach is asymptotically consistent, i.e., that it recovers a globally optimal solution as sampling effort grows to infinity, generalizing previous convergence results for classical expected improvement. Numerical experiments show that our approach dramatically outperforms standard Bayesian optimization benchmarks, reducing simple regret by several orders of magnitude.
LGMar 12, 2019
Practical Multi-fidelity Bayesian Optimization for Hyperparameter TuningJian Wu, Saul Toscano-Palmerin, Peter I. Frazier et al.
Bayesian optimization is popular for optimizing time-consuming black-box objectives. Nonetheless, for hyperparameter tuning in deep neural networks, the time required to evaluate the validation error for even a few hyperparameter settings remains a bottleneck. Multi-fidelity optimization promises relief using cheaper proxies to such objectives --- for example, validation error for a network trained using a subset of the training points or fewer iterations than required for convergence. We propose a highly flexible and practical approach to multi-fidelity Bayesian optimization, focused on efficiently optimizing hyperparameters for iteratively trained supervised learning models. We introduce a new acquisition function, the trace-aware knowledge-gradient, which efficiently leverages both multiple continuous fidelity controls and trace observations --- values of the objective at a sequence of fidelities, available when varying fidelity using training iterations. We provide a provably convergent method for optimizing our acquisition function and show it outperforms state-of-the-art alternatives for hyperparameter tuning of deep neural networks and large-scale kernel learning.
MLJul 8, 2018
A Tutorial on Bayesian OptimizationPeter I. Frazier
Bayesian optimization is an approach to optimizing objective functions that take a long time (minutes or hours) to evaluate. It is best-suited for optimization over continuous domains of less than 20 dimensions, and tolerates stochastic noise in function evaluations. It builds a surrogate for the objective and quantifies the uncertainty in that surrogate using a Bayesian machine learning technique, Gaussian process regression, and then uses an acquisition function defined from this surrogate to decide where to sample. In this tutorial, we describe how Bayesian optimization works, including Gaussian process regression and three common acquisition functions: expected improvement, entropy search, and knowledge gradient. We then discuss more advanced techniques, including running multiple function evaluations in parallel, multi-fidelity and multi-information source optimization, expensive-to-evaluate constraints, random environmental conditions, multi-task Bayesian optimization, and the inclusion of derivative information. We conclude with a discussion of Bayesian optimization software and future research directions in the field. Within our tutorial material we provide a generalization of expected improvement to noisy evaluations, beyond the noise-free setting where it is more commonly applied. This generalization is justified by a formal decision-theoretic argument, standing in contrast to previous ad hoc modifications.
LGMar 23, 2018
Bayesian Optimization with Expensive IntegrandsSaul Toscano-Palmerin, Peter I. Frazier
We propose a Bayesian optimization algorithm for objective functions that are sums or integrals of expensive-to-evaluate functions, allowing noisy evaluations. These objective functions arise in multi-task Bayesian optimization for tuning machine learning hyperparameters, optimization via simulation, and sequential design of experiments with random environmental conditions. Our method is average-case optimal by construction when a single evaluation of the integrand remains within our evaluation budget. Achieving this one-step optimality requires solving a challenging value of information optimization problem, for which we provide a novel efficient discretization-free computational method. We also provide consistency proofs for our method in both continuum and discrete finite domains for objective functions that are sums. In numerical experiments comparing against previous state-of-the-art methods, including those that also leverage sum or integral structure, our method performs as well or better across a wide range of problems and offers significant improvements when evaluations are noisy or the integrand varies smoothly in the integrated variables.
MLJul 20, 2017
Discretization-free Knowledge Gradient Methods for Bayesian OptimizationJian Wu, Peter I. Frazier
This paper studies Bayesian ranking and selection (R&S) problems with correlated prior beliefs and continuous domains, i.e. Bayesian optimization (BO). Knowledge gradient methods [Frazier et al., 2008, 2009] have been widely studied for discrete R&S problems, which sample the one-step Bayes-optimal point. When used over continuous domains, previous work on the knowledge gradient [Scott et al., 2011, Wu and Frazier, 2016, Wu et al., 2017] often rely on a discretized finite approximation. However, the discretization introduces error and scales poorly as the dimension of domain grows. In this paper, we develop a fast discretization-free knowledge gradient method for Bayesian optimization. Our method is not restricted to the fully sequential setting, but useful in all settings where knowledge gradient can be used over continuous domains. We show how our method can be generalized to handle (i) batch of points suggestion (parallel knowledge gradient); (ii) the setting where derivative information is available in the optimization process (derivative-enabled knowledge gradient). In numerical experiments, we demonstrate that the discretization-free knowledge gradient method finds global optima significantly faster than previous Bayesian optimization algorithms on both synthetic test functions and real-world applications, especially when function evaluations are noisy; and derivative-enabled knowledge gradient can further improve the performances, even outperforming the gradient-based optimizer such as BFGS when derivative information is available.
LGJun 14, 2017
Dueling Bandits With Weak RegretBangrui Chen, Peter I. Frazier
We consider online content recommendation with implicit feedback through pairwise comparisons, formalized as the so-called dueling bandit problem. We study the dueling bandit problem in the Condorcet winner setting, and consider two notions of regret: the more well-studied strong regret, which is 0 only when both arms pulled are the Condorcet winner; and the less well-studied weak regret, which is 0 if either arm pulled is the Condorcet winner. We propose a new algorithm for this problem, Winner Stays (WS), with variations for each kind of regret: WS for weak regret (WS-W) has expected cumulative weak regret that is $O(N^2)$, and $O(N\log(N))$ if arms have a total order; WS for strong regret (WS-S) has expected cumulative strong regret of $O(N^2 + N \log(T))$, and $O(N\log(N)+N\log(T))$ if arms have a total order. WS-W is the first dueling bandit algorithm with weak regret that is constant in time. WS is simple to compute, even for problems with many arms, and we demonstrate through numerical experiments on simulated and real data that WS has significantly smaller regret than existing algorithms in both the weak- and strong-regret settings.
MLMar 13, 2017
Bayesian Optimization with GradientsJian Wu, Matthias Poloczek, Andrew Gordon Wilson et al.
Bayesian optimization has been successful at global optimization of expensive-to-evaluate multimodal objective functions. However, unlike most optimization methods, Bayesian optimization typically does not use derivative information. In this paper we show how Bayesian optimization can exploit derivative information to decrease the number of objective function evaluations required for good performance. In particular, we develop a novel Bayesian optimization algorithm, the derivative-enabled knowledge-gradient (dKG), for which we show one-step Bayes-optimality, asymptotic consistency, and greater one-step value of information than is possible in the derivative-free setting. Our procedure accommodates noisy and incomplete derivative information, comes in both sequential and batch forms, and can optionally reduce the computational cost of inference through automatically selected retention of a single directional derivative. We also compute the d-KG acquisition function and its gradient using a novel fast discretization-free technique. We show d-KG provides state-of-the-art performance compared to a wide range of optimization procedures with and without gradients, on benchmarks including logistic regression, deep learning, kernel learning, and k-nearest neighbors.
MLFeb 24, 2017
Bayes-Optimal Entropy Pursuit for Active Choice-Based Preference LearningStephen N. Pallone, Peter I. Frazier, Shane G. Henderson
We analyze the problem of learning a single user's preferences in an active learning setting, sequentially and adaptively querying the user over a finite time horizon. Learning is conducted via choice-based queries, where the user selects her preferred option among a small subset of offered alternatives. These queries have been shown to be a robust and efficient way to learn an individual's preferences. We take a parametric approach and model the user's preferences through a linear classifier, using a Bayesian prior to encode our current knowledge of this classifier. The rate at which we learn depends on the alternatives offered at every time epoch. Under certain noise assumptions, we show that the Bayes-optimal policy for maximally reducing entropy of the posterior distribution of this linear classifier is a greedy policy, and that this policy achieves a linear lower bound when alternatives can be constructed from the continuum. Further, we analyze a different metric called misclassification error, proving that the performance of the optimal policy that minimizes misclassification error is bounded below by a linear function of differential entropy. Lastly, we numerically compare the greedy entropy reduction policy with a knowledge gradient policy under a number of scenarios, examining their performance under both differential entropy and misclassification error.
PRDec 12, 2016
Probabilistic Bisection Converges Almost as Quickly as Stochastic ApproximationPeter I. Frazier, Shane G. Henderson, Rolf Waeber
The probabilistic bisection algorithm (PBA) solves a class of stochastic root-finding problems in one dimension by successively updating a prior belief on the location of the root based on noisy responses to queries at chosen points. The responses indicate the direction of the root from the queried point, and are incorrect with a fixed probability. The fixed-probability assumption is problematic in applications, and so we extend the PBA to apply when this assumption is relaxed. The extension involves the use of a power-one test at each queried point. We explore the convergence behavior of the extended PBA, showing that it converges at a rate arbitrarily close to, but slower than, the canonical "square root" rate of stochastic approximation.
MLAug 11, 2016
Warm Starting Bayesian OptimizationMatthias Poloczek, Jialei Wang, Peter I. Frazier
We develop a framework for warm-starting Bayesian optimization, that reduces the solution time required to solve an optimization problem that is one in a sequence of related problems. This is useful when optimizing the output of a stochastic simulator that fails to provide derivative information, for which Bayesian optimization methods are well-suited. Solving sequences of related optimization problems arises when making several business decisions using one optimization model and input data collected over different time periods or markets. While many gradient-based methods can be warm started by initiating optimization at the solution to the previous problem, this warm start approach does not apply to Bayesian optimization methods, which carry a full metamodel of the objective function from iteration to iteration. Our approach builds a joint statistical model of the entire collection of related objective functions, and uses a value of information calculation to recommend points to evaluate.
OCJul 11, 2016
Multi-Step Bayesian Optimization for One-Dimensional Feasibility DeterminationJ. Massey Cashore, Lemuel Kumarga, Peter I. Frazier
Bayesian optimization methods allocate limited sampling budgets to maximize expensive-to-evaluate functions. One-step-lookahead policies are often used, but computing optimal multi-step-lookahead policies remains a challenge. We consider a specialized Bayesian optimization problem: finding the superlevel set of an expensive one-dimensional function, with a Markov process prior. We compute the Bayes-optimal sampling policy efficiently, and characterize the suboptimality of one-step lookahead. Our numerical experiments demonstrate that the one-step lookahead policy is close to optimal in this problem, performing within 98% of optimal in the experimental settings considered.
MLJun 14, 2016
The Parallel Knowledge Gradient Method for Batch Bayesian OptimizationJian Wu, Peter I. Frazier
In many applications of black-box optimization, one can evaluate multiple points simultaneously, e.g. when evaluating the performances of several different neural network architectures in a parallel computing environment. In this paper, we develop a novel batch Bayesian optimization algorithm --- the parallel knowledge gradient method. By construction, this method provides the one-step Bayes-optimal batch of points to sample. We provide an efficient strategy for computing this Bayes-optimal batch of points, and we demonstrate that the parallel knowledge gradient method finds global optima significantly faster than previous batch Bayesian optimization algorithms on both synthetic test functions and when tuning hyperparameters of practical machine learning algorithms, especially when function evaluations are noisy.
LGMay 30, 2016
The Bayesian Linear Information Filtering ProblemBangrui Chen, Peter I. Frazier
We present a Bayesian sequential decision-making formulation of the information filtering problem, in which an algorithm presents items (news articles, scientific papers, tweets) arriving in a stream, and learns relevance from user feedback on presented items. We model user preferences using a Bayesian linear model, similar in spirit to a Bayesian linear bandit. We compute a computational upper bound on the value of the optimal policy, which allows computing an optimality gap for implementable policies. We then use this analysis as motivation in introducing a pair of new Decompose-Then-Decide (DTD) heuristic policies, DTD-Dynamic-Programming (DTD-DP) and DTD-Upper-Confidence-Bound (DTD-UCB). We compare DTD-DP and DTD-UCB against several benchmarks on real and simulated data, demonstrating significant improvement, and show that the achieved performance is close to the upper bound.
LGMay 28, 2016
Dueling Bandits with Dependent ArmsBangrui Chen, Peter I. Frazier
We study dueling bandits with weak utility-based regret when preferences over arms have a total order and carry observable feature vectors. The order is assumed to be determined by these feature vectors, an unknown preference vector, and a known utility function. This structure introduces dependence between preferences for pairs of arms, and allows learning about the preference over one pair of arms from the preference over another pair of arms. We propose an algorithm for this setting called Comparing The Best (CTB), which we show has constant expected cumulative weak utility-based regret. We provide a Bayesian interpretation for CTB, an implementation appropriate for a small number of arms, and an alternate implementation for many arms that can be used when the input parameters satisfy a decomposability condition. We demonstrate through numerical experiments that CTB with appropriate input parameters outperforms all benchmarks considered.
MLMar 1, 2016
Multi-Information Source OptimizationMatthias Poloczek, Jialei Wang, Peter I. Frazier
We consider Bayesian optimization of an expensive-to-evaluate black-box objective function, where we also have access to cheaper approximations of the objective. In general, such approximations arise in applications such as reinforcement learning, engineering, and the natural sciences, and are subject to an inherent, unknown bias. This model discrepancy is caused by an inadequate internal model that deviates from reality and can vary over the domain, making the utilization of these approximations a non-trivial task. We present a novel algorithm that provides a rigorous mathematical treatment of the uncertainties arising from model discrepancies and noisy observations. Its optimization decisions rely on a value of information analysis that extends the Knowledge Gradient factor to the setting of multiple information sources that vary in cost: each sampling decision maximizes the predicted benefit per unit cost. We conduct an experimental evaluation that demonstrates that the method consistently outperforms other state-of-the-art techniques: it finds designs of considerably higher objective value and additionally inflicts less cost in the exploration process.
LGFeb 7, 2016
Stratified Bayesian OptimizationSaul Toscano-Palmerin, Peter I. Frazier
We consider derivative-free black-box global optimization of expensive noisy functions, when most of the randomness in the objective is produced by a few influential scalar random inputs. We present a new Bayesian global optimization algorithm, called Stratified Bayesian Optimization (SBO), which uses this strong dependence to improve performance. Our algorithm is similar in spirit to stratification, a technique from simulation, which uses strong dependence on a categorical representation of the random input to reduce variance. We demonstrate in numerical experiments that SBO outperforms state-of-the-art Bayesian optimization benchmarks that do not leverage this dependence.
LGDec 31, 2015
Bayes-Optimal Effort Allocation in Crowdsourcing: Bounds and Index PoliciesWeici Hu, Peter I. Frazier
We consider effort allocation in crowdsourcing, where we wish to assign labeling tasks to imperfect homogeneous crowd workers to maximize overall accuracy in a continuous-time Bayesian setting, subject to budget and time constraints. The Bayes-optimal policy for this problem is the solution to a partially observable Markov decision process, but the curse of dimensionality renders the computation infeasible. Based on the Lagrangian Relaxation technique in Adelman & Mersereau (2008), we provide a computationally tractable instance-specific upper bound on the value of this Bayes-optimal policy, which can in turn be used to bound the optimality gap of any other sub-optimal policy. In an approach similar in spirit to the Whittle index for restless multiarmed bandits, we provide an index policy for effort allocation in crowdsourcing and demonstrate numerically that it outperforms other stateof- arts and performs close to optimal solution.
MLJun 3, 2015
Bayesian optimization for materials designPeter I. Frazier, Jialei Wang
We introduce Bayesian optimization, a technique developed for optimizing time-consuming engineering simulations and for fitting machine learning models on large datasets. Bayesian optimization guides the choice of experiments during materials design and discovery to find good material designs in as few experiments as possible. We focus on the case when materials designs are parameterized by a low-dimensional vector. Bayesian optimization is built on a statistical technique called Gaussian process regression, which allows predicting the performance of a new design based on previously tested designs. After providing a detailed introduction to Gaussian process regression, we introduce two Bayesian optimization methods: expected improvement, for design problems with noise-free evaluations; and the knowledge-gradient method, which generalizes expected improvement and may be used in design problems with noisy evaluations. Both methods are derived using a value-of-information analysis, and enjoy one-step Bayes-optimality.
MLMay 25, 2015
Clustering via Content-Augmented Stochastic BlockmodelsJ. Massey Cashore, Xiaoting Zhao, Alexander A. Alemi et al.
Much of the data being created on the web contains interactions between users and items. Stochastic blockmodels, and other methods for community detection and clustering of bipartite graphs, can infer latent user communities and latent item clusters from this interaction data. These methods, however, typically ignore the items' contents and the information they provide about item clusters, despite the tendency of items in the same latent cluster to share commonalities in content. We introduce content-augmented stochastic blockmodels (CASB), which use item content together with user-item interaction data to enhance the user communities and item clusters learned. Comparisons to several state-of-the-art benchmark methods, on datasets arising from scientists interacting with scientific articles, show that content-augmented stochastic blockmodels provide highly accurate clusters with respect to metrics representative of the underlying community structure.
LGOct 29, 2014
A Markov Decision Process Analysis of the Cold Start Problem in Bayesian Information FilteringXiaoting Zhao, Peter I. Frazier
We consider the information filtering problem, in which we face a stream of items, and must decide which ones to forward to a user to maximize the number of relevant items shown, minus a penalty for each irrelevant item shown. Forwarding decisions are made separately in a personalized way for each user. We focus on the cold-start setting for this problem, in which we have limited historical data on the user's preferences, and must rely on feedback from forwarded articles to learn which the fraction of items relevant to the user in each of several item categories. Performing well in this setting requires trading exploration vs. exploitation, forwarding items that are likely to be irrelevant, to allow learning that will improve later performance. In a Bayesian setting, and using Markov decision processes, we show how the Bayes-optimal forwarding algorithm can be computed efficiently when the user will examine each forwarded article, and how an upper bound on the Bayes-optimal procedure and a heuristic index policy can be obtained for the setting when the user will examine only a limited number of forwarded items. We present results from simulation experiments using parameters estimated using historical data from arXiv.org.
OCJul 30, 2014
Exploration vs. Exploitation in the Information Filtering ProblemXiaoting Zhao, Peter I. Frazier
We consider information filtering, in which we face a stream of items too voluminous to process by hand (e.g., scientific articles, blog posts, emails), and must rely on a computer system to automatically filter out irrelevant items. Such systems face the exploration vs. exploitation tradeoff, in which it may be beneficial to present an item despite a low probability of relevance, just to learn about future items with similar content. We present a Bayesian sequential decision-making model of this problem, show how it may be solved to optimality using a decomposition to a collection of two-armed bandit problems, and show structural results for the optimal policy. We show that the resulting method is especially useful when facing the cold start problem, i.e., when filtering items for new users without a long history of past interactions. We then present an application of this information filtering method to a historical dataset from the arXiv.org repository of scientific articles.
ITJul 16, 2014
Probabilistic Group Testing under Sum Observations: A Parallelizable 2-Approximation for Entropy LossWeidong Han, Purnima Rajan, Peter I. Frazier et al.
We consider the problem of group testing with sum observations and noiseless answers, in which we aim to locate multiple objects by querying the number of objects in each of a sequence of chosen sets. We study a probabilistic setting with entropy loss, in which we assume a joint Bayesian prior density on the locations of the objects and seek to choose the sets queried to minimize the expected entropy of the Bayesian posterior distribution after a fixed number of questions. We present a new non-adaptive policy, called the dyadic policy, show it is optimal among non-adaptive policies, and is within a factor of two of optimal among adaptive policies. This policy is quick to compute, its nonadaptive nature makes it easy to parallelize, and our bounds show it performs well even when compared with adaptive policies. We also study an adaptive greedy policy, which maximizes the one-step expected reduction in entropy, and show that it performs at least as well as the dyadic policy, offering greater query efficiency but reduced parallelism. Numerical experiments demonstrate that both procedures outperform a divide-and-conquer benchmark policy from the literature, called sequential bifurcation, and show how these procedures may be applied in a stylized computer vision problem.
OCJul 10, 2014
A New Optimal Stepsize For Approximate Dynamic ProgrammingIlya O. Ryzhov, Peter I. Frazier, Warren B. Powell
Approximate dynamic programming (ADP) has proven itself in a wide range of applications spanning large-scale transportation problems, health care, revenue management, and energy systems. The design of effective ADP algorithms has many dimensions, but one crucial factor is the stepsize rule used to update a value function approximation. Many operations research applications are computationally intensive, and it is important to obtain good results quickly. Furthermore, the most popular stepsize formulas use tunable parameters and can produce very poor results if tuned improperly. We derive a new stepsize rule that optimizes the prediction error in order to improve the short-term performance of an ADP algorithm. With only one, relatively insensitive tunable parameter, the new rule adapts to the level of noise in the problem and produces faster convergence in numerical experiments.