Convergent Tree Backup and Retrace with Function Approximation
This addresses a key challenge in scaling reinforcement learning by enabling stable off-policy learning with function approximation, though it is incremental as it builds on existing algorithms.
The paper tackled the instability of Tree Backup and Retrace algorithms with linear function approximation in off-policy reinforcement learning, and derived stable gradient-based algorithms with convergence guarantees and finite-sample bounds.
Off-policy learning is key to scaling up reinforcement learning as it allows to learn about a target policy from the experience generated by a different behavior policy. Unfortunately, it has been challenging to combine off-policy learning with function approximation and multi-step bootstrapping in a way that leads to both stable and efficient algorithms. In this work, we show that the \textsc{Tree Backup} and \textsc{Retrace} algorithms are unstable with linear function approximation, both in theory and in practice with specific examples. Based on our analysis, we then derive stable and efficient gradient-based algorithms using a quadratic convex-concave saddle-point formulation. By exploiting the problem structure proper to these algorithms, we are able to provide convergence guarantees and finite-sample bounds. The applicability of our new analysis also goes beyond \textsc{Tree Backup} and \textsc{Retrace} and allows us to provide new convergence rates for the GTD and GTD2 algorithms without having recourse to projections or Polyak averaging.