Linear Bandits on Ellipsoids: Minimax Optimal Algorithms
This work addresses a fundamental challenge in bandit optimization for researchers and practitioners, offering a non-incremental solution with strong theoretical guarantees.
The paper tackles the problem of linear stochastic bandits with an ellipsoidal action set by providing the first minimax optimal algorithm, achieving regret matching a derived lower bound up to a constant factor and demonstrating computational efficiency with time complexity O(dT + d^2 log(T/d) + d^3).
We consider linear stochastic bandits where the set of actions is an ellipsoid. We provide the first known minimax optimal algorithm for this problem. We first derive a novel information-theoretic lower bound on the regret of any algorithm, which must be at least $Ω(\min(d σ\sqrt{T} + d \|θ\|_{A}, \|θ\|_{A} T))$ where $d$ is the dimension, $T$ the time horizon, $σ^2$ the noise variance, $A$ a matrix defining the set of actions and $θ$ the vector of unknown parameters. We then provide an algorithm whose regret matches this bound to a multiplicative universal constant. The algorithm is non-classical in the sense that it is not optimistic, and it is not a sampling algorithm. The main idea is to combine a novel sequential procedure to estimate $\|θ\|$, followed by an explore-and-commit strategy informed by this estimate. The algorithm is highly computationally efficient, and a run requires only time $O(dT + d^2 \log(T/d) + d^3)$ and memory $O(d^2)$, in contrast with known optimistic algorithms, which are not implementable in polynomial time. We go beyond minimax optimality and show that our algorithm is locally asymptotically minimax optimal, a much stronger notion of optimality. We further provide numerical experiments to illustrate our theoretical findings.