Pessimism for Offline Linear Contextual Bandits using $\ell_p$ Confidence Sets
This work addresses the problem of offline policy optimization in bandits for researchers and practitioners, offering an adaptively optimal method that is incremental over prior pessimism approaches.
The authors tackled offline learning in linear contextual bandits by introducing a family of pessimistic learning rules based on $\ell_p$ confidence sets, where a novel $\hat{\pi}_\infty$ rule achieves minimax optimal performance (up to log factors) across all $\ell_q$-constrained problems, strictly dominating other predictors like $\hat{\pi}_2$.
We present a family $\{\hatπ\}_{p\ge 1}$ of pessimistic learning rules for offline learning of linear contextual bandits, relying on confidence sets with respect to different $\ell_p$ norms, where $\hatπ_2$ corresponds to Bellman-consistent pessimism (BCP), while $\hatπ_\infty$ is a novel generalization of lower confidence bound (LCB) to the linear setting. We show that the novel $\hatπ_\infty$ learning rule is, in a sense, adaptively optimal, as it achieves the minimax performance (up to log factors) against all $\ell_q$-constrained problems, and as such it strictly dominates all other predictors in the family, including $\hatπ_2$.