ML LG COAug 8, 2018

Nonparametric Gaussian Mixture Models for the Multi-Armed Bandit

arXiv:1808.02932v43.53 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses reward model uncertainty in bandit problems, offering a flexible method that avoids case-by-case model design, though it is incremental as it builds on existing Thompson sampling and nonparametric techniques.

The authors tackled the problem of reward model uncertainty in multi-armed bandits by extending Thompson sampling with Bayesian nonparametric Gaussian mixture models, achieving improved regret performance over state-of-the-art alternatives in diverse environments.

We here adopt Bayesian nonparametric mixture models to extend multi-armed bandits in general, and Thompson sampling in particular, to scenarios where there is reward model uncertainty. In the stochastic multi-armed bandit, the reward for the played arm is generated from an unknown distribution. Reward uncertainty, i.e., the lack of knowledge about the reward-generating distribution, induces the exploration-exploitation trade-off: a bandit agent needs to simultaneously learn the properties of the reward distribution and sequentially decide which action to take next. In this work, we extend Thompson sampling to scenarios where there is reward model uncertainty by adopting Bayesian nonparametric Gaussian mixture models for flexible reward density estimation. The proposed Bayesian nonparametric mixture model Thompson sampling sequentially learns the reward model that best approximates the true, yet unknown, per-arm reward distribution, achieving successful regret performance. We derive, based on a novel posterior convergence based analysis, an asymptotic regret bound for the proposed method. In addition, we empirically evaluate its performance in diverse and previously elusive bandit environments, e.g., with rewards not in the exponential family, subject to outliers, and with different per-arm reward distributions. We show that the proposed Bayesian nonparametric Thompson sampling outperforms, both in averaged cumulative regret and in regret volatility, state-of-the-art alternatives. The proposed method is valuable in the presence of bandit reward model uncertainty, as it avoids stringent case-by-case model design choices, yet provides important regret savings.

View on arXiv PDF Code

Similar