LG OC MLFeb 2, 2021

A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants

Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, Karthikeyan Shanmugam

arXiv:2102.01567v426.370 citations

Originality Highly original

AI Analysis

This work provides theoretical convergence guarantees for asynchronous reinforcement learning algorithms, which is crucial for understanding their reliability and performance for practitioners and researchers in RL.

This paper establishes finite-sample mean-square convergence bounds for a broad range of asynchronous reinforcement learning algorithms, including Q-learning, n-step TD, TD(λ), and off-policy TD algorithms like V-trace. It achieves this by reformulating these algorithms as Markovian Stochastic Approximation algorithms and applying a Lyapunov analysis.

This paper develops an unified framework to study finite-sample convergence guarantees of a large class of value-based asynchronous reinforcement learning (RL) algorithms. We do this by first reformulating the RL algorithms as \textit{Markovian Stochastic Approximation} (SA) algorithms to solve fixed-point equations. We then develop a Lyapunov analysis and derive mean-square error bounds on the convergence of the Markovian SA. Based on this result, we establish finite-sample mean-square convergence bounds for asynchronous RL algorithms such as $Q$-learning, $n$-step TD, TD$(λ)$, and off-policy TD algorithms including V-trace. As a by-product, by analyzing the convergence bounds of $n$-step TD and TD$(λ)$, we provide theoretical insights into the bias-variance trade-off, i.e., efficiency of bootstrapping in RL. This was first posed as an open problem in (Sutton, 1999).

View on arXiv PDF

Similar