When and why randomised exploration works (in linear bandits)
This provides a theoretical foundation for understanding when randomized exploration works in linear bandits, addressing a key challenge in reinforcement learning and bandit algorithms.
The paper tackles the problem of analyzing randomized exploration algorithms like Thompson sampling in linear bandits, showing that under smooth and strongly convex action spaces, they achieve an n-step regret bound of O(d√n log(n)), which is the first proof of optimal dimension dependence for Thompson sampling in such settings.
We provide an approach for the analysis of randomised exploration algorithms like Thompson sampling that does not rely on forced optimism or posterior inflation. With this, we demonstrate that in the $d$-dimensional linear bandit setting, when the action space is smooth and strongly convex, randomised exploration algorithms enjoy an $n$-step regret bound of the order $O(d\sqrt{n} \log(n))$. Notably, this shows for the first time that there exist non-trivial linear bandit settings where Thompson sampling can achieve optimal dimension dependence in the regret.