AISep 7, 2016

Random Shuffling and Resets for the Non-stationary Stochastic Bandit Problem

Robin Allesiardo, Raphaël Féraud, Odalric-Ambrym Maillard

arXiv:1609.02139v19.14 citationsh-index: 17

Originality Highly original

AI Analysis

This addresses the challenge of non-stationary rewards in bandit problems for applications like online advertising or recommendation systems, offering a more robust algorithm.

The paper tackles the non-stationary stochastic bandit problem by introducing a random shuffling modification to Successive Elimination, which achieves the same sample complexity and regret guarantees as the original version but in a wider class of problems without stationary assumptions, with sample complexity of O(Δ^{-2}√(NKδ^{-1} log(K δ^{-1}))) and regret of O(Δ^{-1}√(NTK log(TK))).

We consider a non-stationary formulation of the stochastic multi-armed bandit where the rewards are no longer assumed to be identically distributed. For the best-arm identification task, we introduce a version of Successive Elimination based on random shuffling of the $K$ arms. We prove that under a novel and mild assumption on the mean gap $Δ$, this simple but powerful modification achieves the same guarantees in term of sample complexity and cumulative regret than its original version, but in a much wider class of problems, as it is not anymore constrained to stationary distributions. We also show that the original {\sc Successive Elimination} fails to have controlled regret in this more general scenario, thus showing the benefit of shuffling. We then remove our mild assumption and adapt the algorithm to the best-arm identification task with switching arms. We adapt the definition of the sample complexity for that case and prove that, against an optimal policy with $N-1$ switches of the optimal arm, this new algorithm achieves an expected sample complexity of $O(Δ^{-2}\sqrt{NKδ^{-1} \log(K δ^{-1})})$, where $δ$ is the probability of failure of the algorithm, and an expected cumulative regret of $O(Δ^{-1}{\sqrt{NTK \log (TK)}})$ after $T$ time steps.

View on arXiv PDF

Similar