Schedule Based Temporal Difference Algorithms
This work addresses a limitation in reinforcement learning algorithms for policy evaluation by offering more control over return weights, but it is incremental as it builds directly on existing TD methods.
The paper tackles the problem of learning value functions in reinforcement learning by introducing a λ-schedule procedure that generalizes TD(λ) to allow time-varying λ parameters, enabling flexible weight assignment for n-step returns, and proposes three algorithms with proofs of almost sure convergence under a Markov noise framework.
Learning the value function of a given policy from data samples is an important problem in Reinforcement Learning. TD($λ$) is a popular class of algorithms to solve this problem. However, the weights assigned to different $n$-step returns in TD($λ$), controlled by the parameter $λ$, decrease exponentially with increasing $n$. In this paper, we present a $λ$-schedule procedure that generalizes the TD($λ$) algorithm to the case when the parameter $λ$ could vary with time-step. This allows flexibility in weight assignment, i.e., the user can specify the weights assigned to different $n$-step returns by choosing a sequence $\{λ_t\}_{t \geq 1}$. Based on this procedure, we propose an on-policy algorithm - TD($λ$)-schedule, and two off-policy algorithms - GTD($λ$)-schedule and TDC($λ$)-schedule, respectively. We provide proofs of almost sure convergence for all three algorithms under a general Markov noise framework.