MLLGDec 29, 2021

A Statistical Analysis of Polyak-Ruppert Averaged Q-learning

arXiv:2112.14582v428 citations
Originality Incremental advance
AI Analysis

This work provides theoretical guarantees for online inference in reinforcement learning, which is incremental but offers rigorous statistical foundations for practitioners.

The paper tackles the problem of analyzing Q-learning with Polyak-Ruppert averaging in discounted Markov decision processes, establishing a functional central limit theorem and showing that the averaged estimator is regular asymptotically linear with efficient influence, while providing nonasymptotic error bounds that match instance-dependent lower bounds.

We study Q-learning with Polyak-Ruppert averaging in a discounted Markov decision process in synchronous and tabular settings. Under a Lipschitz condition, we establish a functional central limit theorem for the averaged iteration $\bar{\boldsymbol{Q}}_T$ and show that its standardized partial-sum process converges weakly to a rescaled Brownian motion. The functional central limit theorem implies a fully online inference method for reinforcement learning. Furthermore, we show that $\bar{\boldsymbol{Q}}_T$ is the regular asymptotically linear (RAL) estimator for the optimal Q-value function $\boldsymbol{Q}^*$ that has the most efficient influence function. We present a nonasymptotic analysis for the $\ell_{\infty}$ error, $\mathbb{E}\|\bar{\boldsymbol{Q}}_T-\boldsymbol{Q}^*\|_{\infty}$, showing that it matches the instance-dependent lower bound for polynomial step sizes. Similar results are provided for entropy-regularized Q-learning without the Lipschitz condition.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes