LG MLMay 4, 2022

Non-Stationary Bandit Learning via Predictive Sampling

Stanford

arXiv:2205.01970v816.126 citationsh-index: 55

Originality Highly original

AI Analysis

This addresses a limitation in bandit algorithms for non-stationary environments, which is incremental as it builds on Thompson sampling.

The paper tackled the problem of Thompson sampling performing poorly in non-stationary bandit environments by proposing predictive sampling, which deprioritizes acquiring information that quickly loses usefulness, and demonstrated through simulations that it outperforms Thompson sampling in all examined non-stationary environments.

Thompson sampling has proven effective across a wide range of stationary bandit environments. However, as we demonstrate in this paper, it can perform poorly when applied to non-stationary environments. We attribute such failures to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to non-stationarity. Building upon this insight, we propose predictive sampling, an algorithm that deprioritizes acquiring information that quickly loses usefulness. A theoretical guarantee on the performance of predictive sampling is established through a Bayesian regret bound. We provide versions of predictive sampling for which computations tractably scale to complex bandit environments of practical interest. Through numerical simulations, we demonstrate that predictive sampling outperforms Thompson sampling in all non-stationary environments examined.

View on arXiv PDF

Similar