LG MLSep 9, 2020

Improved Exploration in Factored Average-Reward MDPs

Mohammad Sadegh Talebi, Anders Jonsson, Odalric-Ambrym Maillard

arXiv:2009.04575v35.09 citationsh-index: 24

Originality Highly original

AI Analysis

This work addresses the problem of efficient exploration in complex factored MDPs for reinforcement learning practitioners, offering incremental improvements over prior regret bounds.

The paper tackles regret minimization in unknown Factored Markov Decision Processes (FMDPs) under the average-reward criterion by introducing DBN-UCRL, a novel strategy that uses Bernstein-type confidence sets for transitions, achieving a regret bound with improved dependencies on state sizes and diameter terms compared to existing methods, and demonstrating substantial empirical regret improvements in experiments.

We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the state-action space $\mathcal X$ and the state-space $\mathcal S$ admit the respective factored forms of $\mathcal X = \otimes_{i=1}^n \mathcal X_i$ and $\mathcal S=\otimes_{i=1}^m \mathcal S_i$, and the transition and reward functions are factored over $\mathcal X$ and $\mathcal S$. Assuming known factorization structure, we introduce a novel regret minimization strategy inspired by the popular UCRL2 strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function. We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on the size of $\mathcal S_i$'s and the involved diameter-related terms. We further show that when the factorization structure corresponds to the Cartesian product of some base MDPs, the regret of DBN-UCRL is upper bounded by the sum of regret of the base MDPs. We demonstrate, through numerical experiments on standard environments, that DBN-UCRL enjoys substantially improved regret empirically over existing algorithms that have frequentist regret guarantees.

View on arXiv PDF

Similar