Batched Nonparametric Contextual Bandits
This work addresses the challenge of efficient policy updates in contextual bandits for scenarios with batch processing constraints, offering a solution that reduces the number of updates needed for optimal performance.
The paper tackles the problem of nonparametric contextual bandits with batch constraints, where rewards are smooth functions of covariates and policies update at batch ends, by establishing a minimax regret lower bound and proposing a novel algorithm that achieves optimal regret up to logarithmic factors.
We study nonparametric contextual bandits under batch constraints, where the expected reward for each action is modeled as a smooth function of covariates, and the policy updates are made at the end of each batch of observations. We establish a minimax regret lower bound for this setting and propose a novel batch learning algorithm that achieves the optimal regret (up to logarithmic factors). In essence, our procedure dynamically splits the covariate space into smaller bins, carefully aligning their widths with the batch size. Our theoretical results suggest that for nonparametric contextual bandits, a nearly constant number of policy updates can attain optimal regret in the fully online setting.