LG SY OC MLNov 30, 2023

The Sliding Regret in Stochastic Bandits: Discriminating Index and Randomized Policies

arXiv:2311.18437v13.81 citationsh-index: 1

Originality Incremental advance

AI Analysis

This addresses the problem of algorithm stability in single-run performance for researchers and practitioners in bandit optimization, though it is incremental as it builds on existing regret analysis.

The paper introduces sliding regret to measure the smoothness of pseudo-regret in stochastic bandits, showing that randomized methods like Thompson Sampling achieve optimal sliding regret, while index policies such as UCB have the worst possible sliding regret under certain conditions.

This paper studies the one-shot behavior of no-regret algorithms for stochastic bandits. Although many algorithms are known to be asymptotically optimal with respect to the expected regret, over a single run, their pseudo-regret seems to follow one of two tendencies: it is either smooth or bumpy. To measure this tendency, we introduce a new notion: the sliding regret, that measures the worst pseudo-regret over a time-window of fixed length sliding to infinity. We show that randomized methods (e.g. Thompson Sampling and MED) have optimal sliding regret, while index policies, although possibly asymptotically optimal for the expected regret, have the worst possible sliding regret under regularity conditions on their index (e.g. UCB, UCB-V, KL-UCB, MOSS, IMED etc.). We further analyze the average bumpiness of the pseudo-regret of index policies via the regret of exploration, that we show to be suboptimal as well.

View on arXiv PDF

Similar