MLLGSep 8, 2024

Sliding-Window Thompson Sampling for Non-Stationary Settings

arXiv:2409.05181v380 citationsh-index: 38
Originality Incremental advance
AI Analysis

This work addresses sequential decision-making in dynamic environments for researchers in reinforcement learning and bandit algorithms, but it is incremental as it corrects and generalizes existing analyses.

The paper tackles the problem of non-stationary multi-armed bandits by analyzing sliding-window Thompson sampling algorithms, deriving a unifying regret upper bound that matches state-of-the-art results in common settings.

Non-stationary multi-armed bandits (NS-MABs) model sequential decision-making problems in which the expected rewards of a set of actions, a.k.a.~arms, evolve over time. In this paper, we fill a gap in the literature by providing a novel analysis of Thompson sampling-inspired (TS) algorithms for NS-MABs that both corrects and generalizes existing work. Specifically, we study the cumulative frequentist regret of two algorithms based on sliding-window TS approaches with different priors, namely $\textit{Beta-SWTS}$ and $\textit{$γ$-SWGTS}$. We derive a unifying regret upper bound for these algorithms that applies to any arbitrary NS-MAB (with either Bernoulli or subgaussian rewards). Our result introduces new indices that capture the inherent sources of complexity in the learning problem. Then, we specialize our general result to two of the most common NS-MAB settings: the $\textit{abruptly changing}$ and the $\textit{smoothly changing}$ environments, showing that it matches state-of-the-art results. Finally, we evaluate the performance of the analyzed algorithms in simulated environments and compare them with state-of-the-art approaches for NS-MABs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes