Optimal Learning for Sequential Decision Making for Expensive Cost Functions with Stochastic Binary Feedbacks
This work addresses decision-making in resource-constrained real-world applications, but it is incremental as it extends existing knowledge gradient methods to a specific setting.
The paper tackles the problem of sequential decision-making with expensive cost functions and stochastic binary feedbacks by developing a knowledge gradient policy using an online Bayesian linear classifier to maximize success probability. It shows that the policy is asymptotically optimal in offline settings and extends it to contextual bandits, with experiments demonstrating efficiency.
We consider the problem of sequentially making decisions that are rewarded by "successes" and "failures" which can be predicted through an unknown relationship that depends on a partially controllable vector of attributes for each instance. The learner takes an active role in selecting samples from the instance pool. The goal is to maximize the probability of success in either offline (training) or online (testing) phases. Our problem is motivated by real-world applications where observations are time-consuming and/or expensive. We develop a knowledge gradient policy using an online Bayesian linear classifier to guide the experiment by maximizing the expected value of information of labeling each alternative. We provide a finite-time analysis of the estimated error and show that the maximum likelihood estimator based produced by the KG policy is consistent and asymptotically normal. We also show that the knowledge gradient policy is asymptotically optimal in an offline setting. This work further extends the knowledge gradient to the setting of contextual bandits. We report the results of a series of experiments that demonstrate its efficiency.