LG ST MLFeb 3, 2023

Optimality of Thompson Sampling with Noninformative Priors for Pareto Bandits

Jongyeong Lee, Junya Honda, Chao-Kai Chiang, Masashi Sugiyama

arXiv:2302.01544v16.64 citationsh-index: 86

Originality Incremental advance

AI Analysis

This work addresses the theoretical gap in TS optimality for heavy-tailed bandit models, which is incremental but important for practitioners in reinforcement learning and decision-making under uncertainty.

The paper tackles the optimality of Thompson Sampling (TS) for Pareto bandits, a heavy-tailed two-parameter model, proving that TS with certain probability matching priors achieves optimal regret bounds, while TS with Jeffreys or reference priors is suboptimal unless a truncation procedure is used.

In the stochastic multi-armed bandit problem, a randomized probability matching policy called Thompson sampling (TS) has shown excellent performance in various reward models. In addition to the empirical performance, TS has been shown to achieve asymptotic problem-dependent lower bounds in several models. However, its optimality has been mainly addressed under light-tailed or one-parameter models that belong to exponential families. In this paper, we consider the optimality of TS for the Pareto model that has a heavy tail and is parameterized by two unknown parameters. Specifically, we discuss the optimality of TS with probability matching priors that include the Jeffreys prior and the reference priors. We first prove that TS with certain probability matching priors can achieve the optimal regret bound. Then, we show the suboptimality of TS with other priors, including the Jeffreys and the reference priors. Nevertheless, we find that TS with the Jeffreys and reference priors can achieve the asymptotic lower bound if one uses a truncation procedure. These results suggest carefully choosing noninformative priors to avoid suboptimality and show the effectiveness of truncation procedures in TS-based policies.

View on arXiv PDF

Similar