STLGMLFeb 27, 2024

Batched Nonparametric Contextual Bandits

arXiv:2402.17732v44 citationsh-index: 1IEEE Trans Inf Theory
Originality Highly original
AI Analysis

This work addresses the challenge of efficient policy updates in contextual bandits for scenarios with batch processing constraints, offering a solution that reduces the number of updates needed for optimal performance.

The paper tackles the problem of nonparametric contextual bandits with batch constraints, where rewards are smooth functions of covariates and policies update at batch ends, by establishing a minimax regret lower bound and proposing a novel algorithm that achieves optimal regret up to logarithmic factors.

We study nonparametric contextual bandits under batch constraints, where the expected reward for each action is modeled as a smooth function of covariates, and the policy updates are made at the end of each batch of observations. We establish a minimax regret lower bound for this setting and propose a novel batch learning algorithm that achieves the optimal regret (up to logarithmic factors). In essence, our procedure dynamically splits the covariate space into smaller bins, carefully aligning their widths with the batch size. Our theoretical results suggest that for nonparametric contextual bandits, a nearly constant number of policy updates can attain optimal regret in the fully online setting.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes