PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method
This work addresses a critical bottleneck in reinforcement learning for researchers and practitioners by improving the efficiency of off-policy evaluation, though it is incremental as it builds directly on ETD.
The authors tackled the problem of high variance and exponential sample complexity in Emphatic Temporal Difference (ETD) learning for off-policy value function evaluation by proposing PER-ETD, which uses periodic restarts of the follow-on trace with a logarithmically increasing period, resulting in polynomial sample complexity while converging to the same fixed point as ETD.
Emphatic temporal difference (ETD) learning (Sutton et al., 2016) is a successful method to conduct the off-policy value function evaluation with function approximation. Although ETD has been shown to converge asymptotically to a desirable value function, it is well-known that ETD often encounters a large variance so that its sample complexity can increase exponentially fast with the number of iterations. In this work, we propose a new ETD method, called PER-ETD (i.e., PEriodically Restarted-ETD), which restarts and updates the follow-on trace only for a finite period for each iteration of the evaluation parameter. Further, PER-ETD features a design of the logarithmical increase of the restart period with the number of iterations, which guarantees the best trade-off between the variance and bias and keeps both vanishing sublinearly. We show that PER-ETD converges to the same desirable fixed point as ETD, but improves the exponential sample complexity of ETD to be polynomials. Our experiments validate the superior performance of PER-ETD and its advantage over ETD.