Should All Temporal Difference Learning Use Emphasis?
This addresses convergence problems in reinforcement learning for researchers and practitioners, but it is incremental as it builds on prior work suggesting ETD as a substitute for TD.
The paper tackles the problem of convergence issues in Temporal Difference (TD) learning by empirically showing that Emphatic Temporal Difference (ETD) learning converges on on-policy experiments where TD diverges or performs poorly, and outperforms TD on the mountain car prediction problem.
Emphatic Temporal Difference (ETD) learning has recently been proposed as a convergent off-policy learning method. ETD was proposed mainly to address convergence issues of conventional Temporal Difference (TD) learning under off-policy training but it is different from conventional TD learning even under on-policy training. A simple counterexample provided back in 2017 pointed to a potential class of problems where ETD converges but TD diverges. In this paper, we empirically show that ETD converges on a few other well-known on-policy experiments whereas TD either diverges or performs poorly. We also show that ETD outperforms TD on the mountain car prediction problem. Our results, together with a similar pattern observed under off-policy training in prior works, suggest that ETD might be a good substitute over conventional TD.