Connections Between Mirror Descent, Thompson Sampling and the Information Ratio
This work addresses the theoretical understanding and algorithm design for online learning problems, such as bandits and optimization, by bridging information-theoretic and convex optimization approaches, with incremental improvements in regret bounds.
The paper formalizes the connection between information-theoretic analysis and mirror descent techniques in online learning, showing that existing methods can derive similar bounds, and introduces an efficient algorithm for k-armed adversarial bandits that matches the best information-theoretic upper bound while improving regret guarantees for online linear optimization on ℓ_p-balls and bandits with graph feedback.
The information-theoretic analysis by Russo and Van Roy (2014) in combination with minimax duality has proved a powerful tool for the analysis of online learning algorithms in full and partial information settings. In most applications there is a tantalising similarity to the classical analysis based on mirror descent. We make a formal connection, showing that the information-theoretic bounds in most applications can be derived from existing techniques for online convex optimisation. Besides this, for $k$-armed adversarial bandits we provide an efficient algorithm with regret that matches the best information-theoretic upper bound and improve best known regret guarantees for online linear optimisation on $\ell_p$-balls and bandits with graph feedback.