Offline Neural Contextual Bandits: Pessimism, Optimization and Generalization
It addresses the problem of offline policy learning for researchers and practitioners by providing a provably efficient method with improved generalization and computational efficiency, though it is incremental in building on existing offline learning frameworks.
The paper tackles offline policy learning with neural networks in contextual bandits, proposing a method that generalizes over unseen contexts under mild distributional shift conditions and is computationally efficient, with empirical validation on synthetic and real-world problems.
Offline policy learning (OPL) leverages existing data collected a priori for policy optimization without any active exploration. Despite the prevalence and recent interest in this problem, its theoretical and algorithmic foundations in function approximation settings remain under-developed. In this paper, we consider this problem on the axes of distributional shift, optimization, and generalization in offline contextual bandits with neural networks. In particular, we propose a provably efficient offline contextual bandit with neural network function approximation that does not require any functional assumption on the reward. We show that our method provably generalizes over unseen contexts under a milder condition for distributional shift than the existing OPL works. Notably, unlike any other OPL method, our method learns from the offline data in an online manner using stochastic gradient descent, allowing us to leverage the benefits of online learning into an offline setting. Moreover, we show that our method is more computationally efficient and has a better dependence on the effective dimension of the neural network than an online counterpart. Finally, we demonstrate the empirical effectiveness of our method in a range of synthetic and real-world OPL problems.