LG OC MLJun 11, 2019

Variance-reduced $Q$-learning is minimax optimal

arXiv:1906.04697v225.8107 citations

Originality Incremental advance

AI Analysis

This provides a near-optimal algorithm for reinforcement learning practitioners dealing with sample efficiency in finite MDPs, though it is incremental as it builds on variance reduction techniques.

The paper tackles the problem of sample-efficient Q-learning in discounted MDPs by introducing a variance-reduced variant, achieving an ε-accurate estimate with sample complexity that matches minimax lower bounds up to a logarithmic factor, specifically O(D/(ε^2(1-γ)^3) * log(D/(1-γ))), compared to worse scaling in ordinary Q-learning.

We introduce and analyze a form of variance-reduced $Q$-learning. For $γ$-discounted MDPs with finite state space $\mathcal{X}$ and action space $\mathcal{U}$, we prove that it yields an $ε$-accurate estimate of the optimal $Q$-function in the $\ell_\infty$-norm using $\mathcal{O} \left(\left(\frac{D}{ ε^2 (1-γ)^3} \right) \; \log \left( \frac{D}{(1-γ)} \right) \right)$ samples, where $D = |\mathcal{X}| \times |\mathcal{U}|$. This guarantee matches known minimax lower bounds up to a logarithmic factor in the discount complexity. In contrast, our past work shows that ordinary $Q$-learning has worst-case quartic scaling in the discount complexity.

View on arXiv PDF

Similar