LGOCMLJun 11, 2019

Variance-reduced $Q$-learning is minimax optimal

arXiv:1906.04697v2107 citations
AI Analysis

This provides a near-optimal algorithm for reinforcement learning practitioners dealing with sample efficiency in finite MDPs, though it is incremental as it builds on variance reduction techniques.

The paper tackles the problem of sample-efficient Q-learning in discounted MDPs by introducing a variance-reduced variant, achieving an ε-accurate estimate with sample complexity that matches minimax lower bounds up to a logarithmic factor, specifically O(D/(ε^2(1-γ)^3) * log(D/(1-γ))), compared to worse scaling in ordinary Q-learning.

We introduce and analyze a form of variance-reduced $Q$-learning. For $γ$-discounted MDPs with finite state space $\mathcal{X}$ and action space $\mathcal{U}$, we prove that it yields an $ε$-accurate estimate of the optimal $Q$-function in the $\ell_\infty$-norm using $\mathcal{O} \left(\left(\frac{D}{ ε^2 (1-γ)^3} \right) \; \log \left( \frac{D}{(1-γ)} \right) \right)$ samples, where $D = |\mathcal{X}| \times |\mathcal{U}|$. This guarantee matches known minimax lower bounds up to a logarithmic factor in the discount complexity. In contrast, our past work shows that ordinary $Q$-learning has worst-case quartic scaling in the discount complexity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes