Improved Analysis of UCRL2 with Empirical Bernstein Inequality
This work provides an improved theoretical analysis for reinforcement learning algorithms, addressing regret bounds in MDPs, but it appears incremental as it builds on existing UCRL2 methods.
The paper tackles the exploration-exploitation problem in communicating Markov Decision Processes by analyzing UCRL2 with Empirical Bernstein inequalities, achieving a regret bound of $\widetilde{O}(\sqrt{D\Gamma S A T})$ for MDPs with $S$ states, $A$ actions, $\Gamma \leq S$ next states, and diameter $D$.
We consider the problem of exploration-exploitation in communicating Markov Decision Processes. We provide an analysis of UCRL2 with Empirical Bernstein inequalities (UCRL2B). For any MDP with $S$ states, $A$ actions, $Γ\leq S$ next states and diameter $D$, the regret of UCRL2B is bounded as $\widetilde{O}(\sqrt{DΓS A T})$.