LGMLJun 6, 2020

Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

arXiv:2006.03864v340 citations
Originality Highly original
AI Analysis

This addresses the sample efficiency challenge in reinforcement learning for researchers and practitioners, offering near-optimal theoretical guarantees, though it is incremental in improving existing bounds.

The paper tackles the problem of learning an ε-optimal policy for discounted Markov Decision Processes, providing a model-free algorithm with sample complexity Õ(SA/(ε²(1-γ)^5.5)) and an improved version for small ε with complexity Õ(SA/(ε²(1-γ)^3)), matching the information-theoretic lower bound up to logarithmic factors.

In this paper we consider the problem of learning an $ε$-optimal policy for a discounted Markov Decision Process (MDP). Given an MDP with $S$ states, $A$ actions, the discount factor $γ\in (0,1)$, and an approximation threshold $ε> 0$, we provide a model-free algorithm to learn an $ε$-optimal policy with sample complexity $\tilde{O}(\frac{SA\ln(1/p)}{ε^2(1-γ)^{5.5}})$ (where the notation $\tilde{O}(\cdot)$ hides poly-logarithmic factors of $S,A,1/(1-γ)$, and $1/ε$) and success probability $(1-p)$. For small enough $ε$, we show an improved algorithm with sample complexity $\tilde{O}(\frac{SA\ln(1/p)}{ε^2(1-γ)^{3}})$. While the first bound improves upon all known model-free algorithms and model-based ones with tight dependence on $S$, our second algorithm beats all known sample complexity bounds and matches the information theoretic lower bound up to logarithmic factors.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes