Stochastic Rising Bandits
This addresses a specific variant of bandit problems for sequential decision-making, offering incremental improvements with tailored algorithms for monotonic payoff structures.
The paper tackles the problem of stochastic multi-armed bandits with monotonically non-decreasing expected payoffs, designing algorithms (R-ed-UCB and R-less-UCB) that achieve a regret bound of $\widetilde{\mathcal{O}}(T^{rac{2}{3}})$ under certain circumstances and demonstrate effectiveness in synthetic and real-world experiments.
This paper is in the field of stochastic Multi-Armed Bandits (MABs), i.e., those sequential selection techniques able to learn online using only the feedback given by the chosen option (a.k.a. arm). We study a particular case of the rested and restless bandits in which the arms' expected payoff is monotonically non-decreasing. This characteristic allows designing specifically crafted algorithms that exploit the regularity of the payoffs to provide tight regret bounds. We design an algorithm for the rested case (R-ed-UCB) and one for the restless case (R-less-UCB), providing a regret bound depending on the properties of the instance and, under certain circumstances, of $\widetilde{\mathcal{O}}(T^{\frac{2}{3}})$. We empirically compare our algorithms with state-of-the-art methods for non-stationary MABs over several synthetically generated tasks and an online model selection problem for a real-world dataset. Finally, using synthetic and real-world data, we illustrate the effectiveness of the proposed approaches compared with state-of-the-art algorithms for the non-stationary bandits.