Minhao Shi

1paper

1 Paper

AIFeb 9, 2018
A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning

Long Yang, Minhao Shi, Qian Zheng et al.

Recently, a new multi-step temporal learning algorithm, called $Q(σ)$, unifies $n$-step Tree-Backup (when $σ=0$) and $n$-step Sarsa (when $σ=1$) by introducing a sampling parameter $σ$. However, similar to other multi-step temporal-difference learning algorithms, $Q(σ)$ needs much memory consumption and computation time. Eligibility trace is an important mechanism to transform the off-line updates into efficient on-line ones which consume less memory and computation time. In this paper, we further develop the original $Q(σ)$, combine it with eligibility traces and propose a new algorithm, called $Q(σ,λ)$, in which $λ$ is trace-decay parameter. This idea unifies Sarsa$(λ)$ (when $σ=1$) and $Q^π(λ)$ (when $σ=0$). Furthermore, we give an upper error bound of $Q(σ,λ)$ policy evaluation algorithm. We prove that $Q(σ,λ)$ control algorithm can converge to the optimal value function exponentially. We also empirically compare it with conventional temporal-difference learning methods. Results show that, with an intermediate value of $σ$, $Q(σ,λ)$ creates a mixture of the existing algorithms that can learn the optimal value significantly faster than the extreme end ($σ=0$, or $1$).