Tracking Most Significant Arm Switches in Bandits
This work addresses the challenge of improving regret rates in bandit problems with distribution shifts, offering a more efficient adaptive procedure for scenarios with few severe changes, which is incremental but provides specific gains over existing methods.
The paper tackles the problem of adapting to distribution shifts in bandits by proposing a new notion of significant shifts, focusing only on severe changes that necessitate restarts, and achieves faster regret rates than prior optimal bounds, such as O(sqrt(ST)) where S counts best arm switches, and O(V^(1/3)T^(2/3)) in terms of total variation V.
In bandit with distribution shifts, one aims to automatically adapt to unknown changes in reward distribution, and restart exploration when necessary. While this problem has been studied for many years, a recent breakthrough of Auer et al. (2018, 2019) provides the first adaptive procedure to guarantee an optimal (dynamic) regret $\sqrt{LT}$, for $T$ rounds, and an unknown number $L$ of changes. However, while this rate is tight in the worst case, it remained open whether faster rates are possible, without prior knowledge, if few changes in distribution are actually severe. To resolve this question, we propose a new notion of significant shift, which only counts very severe changes that clearly necessitate a restart: roughly, these are changes involving not only best arm switches, but also involving large aggregate differences in reward overtime. Thus, our resulting procedure adaptively achieves rates always faster (sometimes significantly) than $O(\sqrt{ST})$, where $S\ll L$ only counts best arm switches, while at the same time, always faster than the optimal $O(V^{\frac{1}{3}}T^{\frac{2}{3}})$ when expressed in terms of total variation $V$ (which aggregates differences overtime). Our results are expressed in enough generality to also capture non-stochastic adversarial settings.