MLLGSTJul 11, 2024

Adaptive Smooth Non-Stationary Bandits

arXiv:2407.08654v24 citationsh-index: 2
AI Analysis

This work resolves open questions in bandit theory by providing adaptive algorithms for smooth non-stationary environments, which is incremental but addresses specific gaps in prior literature.

The paper tackles the problem of non-stationary bandits with smooth reward changes, establishing the minimax dynamic regret rate for all parameters and achieving it adaptively without prior knowledge, while also deriving faster gap-dependent regret rates in environments with a safe arm, showing these rates are tight and reveal optimistic regimes.

We study a $K$-armed non-stationary bandit model where rewards change smoothly, as captured by Hölder class assumptions on rewards as functions of time. Such smooth changes are parametrized by a Hölder exponent $β$ and coefficient $λ$. While various sub-cases of this general model have been studied in isolation, we first establish the minimax dynamic regret rate generally for all $K,β,λ$. Next, we show this optimal dynamic regret can be attained adaptively, without knowledge of $β,λ$. To contrast, even with parameter knowledge, upper bounds were only previously known for limited regimes $β\leq 1$ and $β=2$ (Slivkins, 2014; Krishnamurthy and Gopalan, 2021; Manegueu et al., 2021; Jia et al.,2023). Thus, our work resolves open questions raised by these disparate threads of the literature. We also study the problem of attaining faster gap-dependent regret rates in non-stationary bandits. While such rates are long known to be impossible in general (Garivier and Moulines, 2011), we show that environments admitting a safe arm (Suk and Kpotufe, 2022) allow for much faster rates than the worst-case scaling with $\sqrt{T}$. While previous works in this direction focused on attaining the usual logarithmic regret bounds, as summed over stationary periods, our new gap-dependent rates reveal new optimistic regimes of non-stationarity where even the logarithmic bounds are pessimistic. We show our new gap-dependent rate is tight and that its achievability (i.e., as made possible by a safe arm) has a surprisingly simple and clean characterization within the smooth Hölder class model.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes