Q($λ$) with Off-Policy Corrections
This work addresses the challenge of off-policy learning in reinforcement learning, offering a method that is incremental but provides theoretical guarantees for convergence.
The paper tackles the problem of off-policy multi-step temporal difference learning by proposing an approach that corrects off-policy returns using the current Q-function for rewards instead of the target policy for transition probabilities. It proves that these approximate corrections ensure convergence in policy evaluation and control under specific conditions, which are empirically validated on a continuous-state control task.
We propose and analyze an alternate approach to off-policy multi-step temporal difference learning, in which off-policy returns are corrected with the current Q-function in terms of rewards, rather than with the target policy in terms of transition probabilities. We prove that such approximate corrections are sufficient for off-policy convergence both in policy evaluation and control, provided certain conditions. These conditions relate the distance between the target and behavior policies, the eligibility trace parameter and the discount factor, and formalize an underlying tradeoff in off-policy TD($λ$). We illustrate this theoretical relationship empirically on a continuous-state control task.