MLLGMay 16, 2022

From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses

arXiv:2205.07704v226 citationsh-index: 68
Originality Highly original
AI Analysis

This addresses the challenge of efficient exploration in RL for researchers and practitioners, offering a simpler alternative to existing methods, though it is incremental as it builds on prior bandit algorithms.

The paper tackles the problem of optimistic exploration in reinforcement learning for tabular, episodic Markov decision processes by proposing the Bayes-UCBVI algorithm, which achieves a regret bound of order Õ(√(H³SAT)), matching the lower bound up to logarithmic terms and providing the first optimal dependence on horizon H without complex bonuses.

We propose the Bayes-UCBVI algorithm for reinforcement learning in tabular, stage-dependent, episodic Markov decision process: a natural extension of the Bayes-UCB algorithm by Kaufmann et al. (2012) for multi-armed bandits. Our method uses the quantile of a Q-value function posterior as upper confidence bound on the optimal Q-value function. For Bayes-UCBVI, we prove a regret bound of order $\widetilde{O}(\sqrt{H^3SAT})$ where $H$ is the length of one episode, $S$ is the number of states, $A$ the number of actions, $T$ the number of episodes, that matches the lower-bound of $Ω(\sqrt{H^3SAT})$ up to poly-$\log$ terms in $H,S,A,T$ for a large enough $T$. To the best of our knowledge, this is the first algorithm that obtains an optimal dependence on the horizon $H$ (and $S$) without the need for an involved Bernstein-like bonus or noise. Crucial to our analysis is a new fine-grained anti-concentration bound for a weighted Dirichlet sum that can be of independent interest. We then explain how Bayes-UCBVI can be easily extended beyond the tabular setting, exhibiting a strong link between our algorithm and Bayesian bootstrap (Rubin, 1981).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes