LGFeb 19, 2021

Near-Optimal Randomized Exploration for Tabular Markov Decision Processes

Zhihan Xiong, Ruoqi Shen, Qiwen Cui, Maryam Fazel, Simon S. Du

arXiv:2102.09703v512.511 citationsh-index: 32

Originality Highly original

AI Analysis

This work provides a near-optimal solution for reinforcement learning practitioners dealing with exploration challenges in episodic MDPs, showing that randomized methods can match optimistic algorithms, which is a significant theoretical advancement.

The paper tackles the problem of exploration in reinforcement learning for tabular Markov Decision Processes by proposing a randomized value function algorithm with a single random seed and Bernstein-type noise, achieving a worst-case regret bound of $\widetilde{O}\left(H\sqrt{SAT} ight)$ that matches the lower bound up to logarithmic factors.

We study algorithms using randomized value functions for exploration in reinforcement learning. This type of algorithms enjoys appealing empirical performance. We show that when we use 1) a single random seed in each episode, and 2) a Bernstein-type magnitude of noise, we obtain a worst-case $\widetilde{O}\left(H\sqrt{SAT}\right)$ regret bound for episodic time-inhomogeneous Markov Decision Process where $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon and $T$ is the number of interactions. This bound polynomially improves all existing bounds for algorithms based on randomized value functions, and for the first time, matches the $Ω\left(H\sqrt{SAT}\right)$ lower bound up to logarithmic factors. Our result highlights that randomized exploration can be near-optimal, which was previously achieved only by optimistic algorithms. To achieve the desired result, we develop 1) a new clipping operation to ensure both the probability of being optimistic and the probability of being pessimistic are lower bounded by a constant, and 2) a new recursive formula for the absolute value of estimation errors to analyze the regret.

View on arXiv PDF

Similar