Optimal Best-Arm Identification in Bandits with Access to Offline Data
This work addresses the practical challenge of efficiently using offline data to enhance online decision-making in bandit problems, representing an incremental advance in hybrid learning methods.
The paper tackles the problem of identifying the best arm in stochastic bandits by combining offline data with online learning, achieving algorithms that match a lower bound on sample complexity for small error probabilities and are computationally efficient with per-sample cost of Õ(K).
Learning paradigms based purely on offline data as well as those based solely on sequential online learning have been well-studied in the literature. In this paper, we consider combining offline data with online learning, an area less studied but of obvious practical importance. We consider the stochastic $K$-armed bandit problem, where our goal is to identify the arm with the highest mean in the presence of relevant offline data, with confidence $1-δ$. We conduct a lower bound analysis on policies that provide such $1-δ$ probabilistic correctness guarantees. We develop algorithms that match the lower bound on sample complexity when $δ$ is small. Our algorithms are computationally efficient with an average per-sample acquisition cost of $\tilde{O}(K)$, and rely on a careful characterization of the optimality conditions of the lower bound problem.