92.3MLApr 23
A single algorithm for both restless and rested rotting banditsJulien Seznec, Pierre Ménard, Alessandro Lazaric et al.
In many application domains (e.g., recommender systems, intelligent tutoring systems), the rewards associated to the actions tend to decrease over time. This decay is either caused by the actions executed in the past (e.g., a user may get bored when songs of the same genre are recommended over and over) or by an external factor (e.g., content becomes outdated). These two situations can be modeled as specific instances of the rested and restless bandit settings, where arms are rotting (i.e., their value decrease over time). These problems were thought to be significantly different, since Levine et al. (2017) showed that state-of-the-art algorithms for restless bandit perform poorly in the rested rotting setting. In this paper, we introduce a novel algorithm, Rotting Adaptive Window UCB (RAW-UCB), that achieves near-optimal regret in both rotting rested and restless bandit, without any prior knowledge of the setting (rested or restless) and the type of non-stationarity (e.g., piece-wise constant, bounded variation). This is in striking contrast with previous negative results showing that no algorithm can achieve similar results as soon as rewards are allowed to increase. We confirm our theoretical findings on a number of synthetic and dataset-based experiments.
MLFeb 5, 2019
Efficient Change-Point Detection for Tackling Piecewise-Stationary BanditsLilian Besson, Emilie Kaufmann, Odalric-Ambrym Maillard et al.
We introduce GLR-klUCB, a novel algorithm for the piecewise iid non-stationary bandit problem with bounded rewards. This algorithm combines an efficient bandit algorithm, kl-UCB, with an efficient, parameter-free, changepoint detector, the Bernoulli Generalized Likelihood Ratio Test, for which we provide new theoretical guarantees of independent interest. Unlike previous non-stationary bandit algorithms using a change-point detector, GLR-klUCB does not need to be calibrated based on prior knowledge on the arms' means. We prove that this algorithm can attain a $O(\sqrt{TA Υ_T\log(T)})$ regret in $T$ rounds on some "easy" instances, where A is the number of arms and $Υ_T$ the number of change-points, without prior knowledge of $Υ_T$. In contrast with recently proposed algorithms that are agnostic to $Υ_T$, we perform a numerical study showing that GLR-klUCB is also very efficient in practice, beyond easy instances.
MLNov 27, 2018
Rotting bandits are not harder than stochastic onesJulien Seznec, Andrea Locatelli, Alexandra Carpentier et al.
In stochastic multi-armed bandits, the reward distribution of each arm is assumed to be stationary. This assumption is often violated in practice (e.g., in recommendation systems), where the reward of an arm may change whenever is selected, i.e., rested bandit setting. In this paper, we consider the non-parametric rotting bandit setting, where rewards can only decrease. We introduce the filtering on expanding window average (FEWA) algorithm that constructs moving averages of increasing windows to identify arms that are more likely to return high rewards when pulled once more. We prove that for an unknown horizon $T$, and without any knowledge on the decreasing behavior of the $K$ arms, FEWA achieves problem-dependent regret bound of $\widetilde{\mathcal{O}}(\log{(KT)}),$ and a problem-independent one of $\widetilde{\mathcal{O}}(\sqrt{KT})$. Our result substantially improves over the algorithm of Levine et al. (2017), which suffers regret $\widetilde{\mathcal{O}}(K^{1/3}T^{2/3})$. FEWA also matches known bounds for the stochastic bandit setting, thus showing that the rotting bandits are not harder. Finally, we report simulations confirming the theoretical improvements of FEWA.