LG MLOct 27, 2020

Temporal Difference Learning as Gradient Splitting

arXiv:2010.14657v112.020 citations

Originality Incremental advance

AI Analysis

This work provides a theoretical improvement for reinforcement learning practitioners by offering a new interpretation that simplifies convergence analysis and yields better finite-time bounds, though it is incremental as it builds on existing TD methods.

The paper tackles the problem of understanding and improving the convergence of temporal difference (TD) learning with linear function approximation in Markov Decision Processes by reinterpreting it as gradient splitting, which allows applying gradient descent convergence proofs. This leads to a minor variation on TD learning that reduces the convergence time bound, specifically eliminating the multiplicative factor 1/(1-γ) from the dominant term, where γ is the discount factor.

Temporal difference learning with linear function approximation is a popular method to obtain a low-dimensional approximation of the value function of a policy in a Markov Decision Process. We give a new interpretation of this method in terms of a splitting of the gradient of an appropriately chosen function. As a consequence of this interpretation, convergence proofs for gradient descent can be applied almost verbatim to temporal difference learning. Beyond giving a new, fuller explanation of why temporal difference works, our interpretation also yields improved convergence times. We consider the setting with $1/\sqrt{T}$ step-size, where previous comparable finite-time convergence time bounds for temporal difference learning had the multiplicative factor $1/(1-γ)$ in front of the bound, with $γ$ being the discount factor. We show that a minor variation on TD learning which estimates the mean of the value function separately has a convergence time where $1/(1-γ)$ only multiplies an asymptotically negligible term.

View on arXiv PDF

Similar