ML LG STJul 11, 2024

Adaptive Smooth Non-Stationary Bandits

arXiv:2407.08654v24 citationsh-index: 2

AI Analysis

This work resolves open questions in bandit theory by providing adaptive algorithms for smooth non-stationary environments, which is incremental but addresses specific gaps in prior literature.

The paper tackles the problem of non-stationary bandits with smooth reward changes, establishing the minimax dynamic regret rate for all parameters and achieving it adaptively without prior knowledge, while also deriving faster gap-dependent regret rates in environments with a safe arm, showing these rates are tight and reveal optimistic regimes.

We study a $K$-armed non-stationary bandit model where rewards change smoothly, as captured by Hölder class assumptions on rewards as functions of time. Such smooth changes are parametrized by a Hölder exponent $β$ and coefficient $λ$. While various sub-cases of this general model have been studied in isolation, we first establish the minimax dynamic regret rate generally for all $K,β,λ$. Next, we show this optimal dynamic regret can be attained adaptively, without knowledge of $β,λ$. To contrast, even with parameter knowledge, upper bounds were only previously known for limited regimes $β\leq 1$ and $β=2$ (Slivkins, 2014; Krishnamurthy and Gopalan, 2021; Manegueu et al., 2021; Jia et al.,2023). Thus, our work resolves open questions raised by these disparate threads of the literature. We also study the problem of attaining faster gap-dependent regret rates in non-stationary bandits. While such rates are long known to be impossible in general (Garivier and Moulines, 2011), we show that environments admitting a safe arm (Suk and Kpotufe, 2022) allow for much faster rates than the worst-case scaling with $\sqrt{T}$. While previous works in this direction focused on attaining the usual logarithmic regret bounds, as summed over stationary periods, our new gap-dependent rates reveal new optimistic regimes of non-stationarity where even the logarithmic bounds are pessimistic. We show our new gap-dependent rate is tight and that its achievability (i.e., as made possible by a safe arm) has a surprisingly simple and clean characterization within the smooth Hölder class model.

View on arXiv PDF

Similar