Improved Bayesian Regret Bounds for Thompson Sampling in Reinforcement Learning
This work provides foundational theoretical guarantees for Thompson Sampling in RL, addressing a key problem for researchers and practitioners in reinforcement learning and decision-making under uncertainty.
The paper tackles the problem of analyzing Bayesian regret bounds for Thompson Sampling in reinforcement learning, proving the first such bounds in multiple settings and achieving an upper bound of order O~(H√(d_l1 T)), with concrete bounds derived for tabular, linear, and finite mixture environments.
In this paper, we prove the first Bayesian regret bounds for Thompson Sampling in reinforcement learning in a multitude of settings. We simplify the learning problem using a discrete set of surrogate environments, and present a refined analysis of the information ratio using posterior consistency. This leads to an upper bound of order $\widetilde{O}(H\sqrt{d_{l_1}T})$ in the time inhomogeneous reinforcement learning problem where $H$ is the episode length and $d_{l_1}$ is the Kolmogorov $l_1-$dimension of the space of environments. We then find concrete bounds of $d_{l_1}$ in a variety of settings, such as tabular, linear and finite mixtures, and discuss how how our results are either the first of their kind or improve the state-of-the-art.