AI LG MLFeb 9, 2018

A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning

Long Yang, Minhao Shi, Qian Zheng, Wenjia Meng, Gang Pan

arXiv:1802.03171v110.124 citations

Originality Incremental advance

AI Analysis

This work addresses memory and computation bottlenecks in reinforcement learning for practitioners, offering an incremental improvement by unifying existing algorithms.

The paper tackles the inefficiency of multi-step temporal-difference learning algorithms like Q(σ) by combining it with eligibility traces to propose Q(σ,λ), which unifies Sarsa(λ) and Q^π(λ). Results show that with an intermediate σ, Q(σ,λ) learns the optimal value significantly faster than extreme cases, converging exponentially to the optimal value function.

Recently, a new multi-step temporal learning algorithm, called $Q(σ)$, unifies $n$-step Tree-Backup (when $σ=0$) and $n$-step Sarsa (when $σ=1$) by introducing a sampling parameter $σ$. However, similar to other multi-step temporal-difference learning algorithms, $Q(σ)$ needs much memory consumption and computation time. Eligibility trace is an important mechanism to transform the off-line updates into efficient on-line ones which consume less memory and computation time. In this paper, we further develop the original $Q(σ)$, combine it with eligibility traces and propose a new algorithm, called $Q(σ,λ)$, in which $λ$ is trace-decay parameter. This idea unifies Sarsa$(λ)$ (when $σ=1$) and $Q^π(λ)$ (when $σ=0$). Furthermore, we give an upper error bound of $Q(σ,λ)$ policy evaluation algorithm. We prove that $Q(σ,λ)$ control algorithm can converge to the optimal value function exponentially. We also empirically compare it with conventional temporal-difference learning methods. Results show that, with an intermediate value of $σ$, $Q(σ,λ)$ creates a mixture of the existing algorithms that can learn the optimal value significantly faster than the extreme end ($σ=0$, or $1$).

View on arXiv PDF

Similar