OSOM: A simultaneously optimal algorithm for multi-armed and linear contextual bandits
This addresses the challenge of data-dependent policy class selection in contextual bandits, offering a unified solution that avoids sub-optimal performance in either regime, though it is incremental in advancing bandit algorithms.
The paper tackles the problem of designing a single algorithm that performs optimally in both multi-armed and linear contextual bandit regimes without prior knowledge of the reward model, achieving problem-dependent optimal regret rates for multi-armed bandits and minimax optimal rates for linear contextual bandits under stochastic contexts.
We consider the stochastic linear (multi-armed) contextual bandit problem with the possibility of hidden simple multi-armed bandit structure in which the rewards are independent of the contextual information. Algorithms that are designed solely for one of the regimes are known to be sub-optimal for the alternate regime. We design a single computationally efficient algorithm that simultaneously obtains problem-dependent optimal regret rates in the simple multi-armed bandit regime and minimax optimal regret rates in the linear contextual bandit regime, without knowing a priori which of the two models generates the rewards. These results are proved under the condition of stochasticity of contextual information over multiple rounds. Our results should be viewed as a step towards principled data-dependent policy class selection for contextual bandits.