Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits
This provides tight theoretical guarantees for linear contextual bandits, a key problem in online learning and reinforcement learning, with incremental improvements in regret bounds.
The paper establishes a nearly minimax-optimal regret bound of Ω(√(dT(log T)(log n))) for linear contextual bandits with finite action sets and introduces a VCL SupLinUCB algorithm that matches this bound up to iterated logarithmic factors, saving two √(log T) factors from prior work.
We study the linear contextual bandit problem with finite action sets. When the problem dimension is $d$, the time horizon is $T$, and there are $n \leq 2^{d/2}$ candidate actions per time period, we (1) show that the minimax expected regret is $Ω(\sqrt{dT (\log T) (\log n)})$ for every algorithm, and (2) introduce a Variable-Confidence-Level (VCL) SupLinUCB algorithm whose regret matches the lower bound up to iterated logarithmic factors. Our algorithmic result saves two $\sqrt{\log T}$ factors from previous analysis, and our information-theoretical lower bound also improves previous results by one $\sqrt{\log T}$ factor, revealing a regret scaling quite different from classical multi-armed bandits in which no logarithmic $T$ term is present in minimax regret. Our proof techniques include variable confidence levels and a careful analysis of layer sizes of SupLinUCB on the upper bound side, and delicately constructed adversarial sequences showing the tightness of elliptical potential lemmas on the lower bound side.