ML LGDec 29, 2021

A Statistical Analysis of Polyak-Ruppert Averaged Q-learning

Xiang Li, Wenhao Yang, Jiadong Liang, Zhihua Zhang, Michael I. Jordan

arXiv:2112.14582v416.928 citationsHas Code

Originality Incremental advance

AI Analysis

This work provides theoretical guarantees for online inference in reinforcement learning, which is incremental but offers rigorous statistical foundations for practitioners.

The paper tackles the problem of analyzing Q-learning with Polyak-Ruppert averaging in discounted Markov decision processes, establishing a functional central limit theorem and showing that the averaged estimator is regular asymptotically linear with efficient influence, while providing nonasymptotic error bounds that match instance-dependent lower bounds.

We study Q-learning with Polyak-Ruppert averaging in a discounted Markov decision process in synchronous and tabular settings. Under a Lipschitz condition, we establish a functional central limit theorem for the averaged iteration $\bar{\boldsymbol{Q}}_T$ and show that its standardized partial-sum process converges weakly to a rescaled Brownian motion. The functional central limit theorem implies a fully online inference method for reinforcement learning. Furthermore, we show that $\bar{\boldsymbol{Q}}_T$ is the regular asymptotically linear (RAL) estimator for the optimal Q-value function $\boldsymbol{Q}^*$ that has the most efficient influence function. We present a nonasymptotic analysis for the $\ell_{\infty}$ error, $\mathbb{E}\|\bar{\boldsymbol{Q}}_T-\boldsymbol{Q}^*\|_{\infty}$, showing that it matches the instance-dependent lower bound for polynomial step sizes. Similar results are provided for entropy-regularized Q-learning without the Lipschitz condition.

View on arXiv PDF Code

Similar