How to gamble with non-stationary $\mathcal{X}$-armed bandits and have no regrets
This addresses the challenge of adapting to abrupt changes in bandit environments, which is incremental but important for applications like dynamic recommendation systems.
The paper tackles the problem of non-stationary X-armed bandits, where the environment changes abruptly, by proposing a novel strategy that achieves sub-linear cumulative regret and is nearly optimal for highly smooth reward functions.
In $\mathcal{X}$-armed bandit problem an agent sequentially interacts with environment which yields a reward based on the vector input the agent provides. The agent's goal is to maximise the sum of these rewards across some number of time steps. The problem and its variations have been a subject of numerous studies, suggesting sub-linear and some times optimal strategies. The given paper introduces a novel variation of the problem. We consider an environment, which can abruptly change its behaviour an unknown number of times. To that end we propose a novel strategy and prove it attains sub-linear cumulative regret. Moreover, in case of highly smooth relation between an action and the corresponding reward, the method is nearly optimal. The theoretical result are supported by experimental study.