Instance-optimal PAC Algorithms for Contextual Bandits
This work addresses the understudied problem of instance-optimal best-arm identification in contextual bandits, which is incremental as it builds on existing regret minimization research but introduces new theoretical characterizations and algorithms.
The paper tackles the problem of identifying near-optimal policies in stochastic contextual bandits with PAC guarantees, establishing the first instance-dependent sample complexity bounds and providing matching upper and lower bounds for agnostic and linear settings, while showing that no algorithm can be both minimax-optimal for regret and instance-optimal for best-arm identification.
In the stochastic contextual bandit setting, regret-minimizing algorithms have been extensively researched, but their instance-minimizing best-arm identification counterparts remain seldom studied. In this work, we focus on the stochastic bandit problem in the $(ε,δ)$-$\textit{PAC}$ setting: given a policy class $Π$ the goal of the learner is to return a policy $π\in Π$ whose expected reward is within $ε$ of the optimal policy with probability greater than $1-δ$. We characterize the first $\textit{instance-dependent}$ PAC sample complexity of contextual bandits through a quantity $ρ_Π$, and provide matching upper and lower bounds in terms of $ρ_Π$ for the agnostic and linear contextual best-arm identification settings. We show that no algorithm can be simultaneously minimax-optimal for regret minimization and instance-dependent PAC for best-arm identification. Our main result is a new instance-optimal and computationally efficient algorithm that relies on a polynomial number of calls to an argmax oracle.