Incentivized Exploration of Non-Stationary Stochastic Bandits
This addresses the problem of incentivizing exploration in dynamic bandit settings for applications like online platforms, though it appears incremental as it extends prior work on stationary bandits to non-stationary cases.
The paper tackles incentivized exploration in non-stationary multi-armed bandits with biased feedback, proposing algorithms for abruptly- and continuously-changing environments that achieve sublinear regret and compensation over time.
We study incentivized exploration for the multi-armed bandit (MAB) problem with non-stationary reward distributions, where players receive compensation for exploring arms other than the greedy choice and may provide biased feedback on the reward. We consider two different non-stationary environments: abruptly-changing and continuously-changing, and propose respective incentivized exploration algorithms. We show that the proposed algorithms achieve sublinear regret and compensation over time, thus effectively incentivizing exploration despite the nonstationarity and the biased or drifted feedback.