LG MLFeb 19, 2023

Estimating Optimal Policy Value in General Linear Contextual Bandits

Jonathan N. Lee, Weihao Kong, Aldo Pacchiano, Vidya Muthukumar, Emma Brunskill

arXiv:2302.09451v12.0h-index: 54

Originality Highly original

AI Analysis

This addresses the challenge of evaluating policies before they are learnable, which is important for tasks such as bandit model selection and treatment effect testing, though it is incremental with specific assumptions.

The paper tackles the problem of estimating the optimal policy value in general linear contextual bandits, showing that while it is generally hard, under stronger assumptions, sublinear estimation with O(√d) samples is possible, and they provide a practical algorithm with improved guarantees for applications like model selection.

In many bandit problems, the maximal reward achievable by a policy is often unknown in advance. We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable. We refer to this as $V^*$ estimation. It was recently shown that fast $V^*$ estimation is possible but only in disjoint linear bandits with Gaussian covariates. Whether this is possible for more realistic context distributions has remained an open and important question for tasks such as model selection. In this paper, we first provide lower bounds showing that this general problem is hard. However, under stronger assumptions, we give an algorithm and analysis proving that $\widetilde{\mathcal{O}}(\sqrt{d})$ sublinear estimation of $V^*$ is indeed information-theoretically possible, where $d$ is the dimension. We then present a more practical, computationally efficient algorithm that estimates a problem-dependent upper bound on $V^*$ that holds for general distributions and is tight when the context distribution is Gaussian. We prove our algorithm requires only $\widetilde{\mathcal{O}}(\sqrt{d})$ samples to estimate the upper bound. We use this upper bound and the estimator to obtain novel and improved guarantees for several applications in bandit model selection and testing for treatment effects.

View on arXiv PDF

Similar