Chained Information-Theoretic bounds and Tight Regret Rate for Linear Bandit Problems
This work provides incremental improvements in regret bounds for linear bandit algorithms, benefiting researchers in reinforcement learning and optimization.
The paper tackles the Bayesian regret of a Thompson-Sampling variant for linear bandit problems, establishing new bounds based on metric entropy that achieve a tight regret rate of O(d√T) for d-dimensional settings.
This paper studies the Bayesian regret of a variant of the Thompson-Sampling algorithm for bandit problems. It builds upon the information-theoretic framework of [Russo and Van Roy, 2015] and, more specifically, on the rate-distortion analysis from [Dong and Van Roy, 2020], where they proved a bound with regret rate of $O(d\sqrt{T \log(T)})$ for the $d$-dimensional linear bandit setting. We focus on bandit problems with a metric action space and, using a chaining argument, we establish new bounds that depend on the metric entropy of the action space for a variant of Thompson-Sampling. Under suitable continuity assumption of the rewards, our bound offers a tight rate of $O(d\sqrt{T})$ for $d$-dimensional linear bandit problems.