Stability and Sensitivity Analysis of Relative Temporal-Difference Learning: Extended Version
Provides theoretical stability guarantees for relative TD learning with function approximation, addressing a known bottleneck in TD methods for near-discount factors.
Relative TD learning with linear function approximation is analyzed, establishing stability conditions and showing that with an empirical baseline distribution, the algorithm is stable for any non-negative baseline weight and discount factor. Asymptotic bias and covariance remain uniformly bounded as the discount factor approaches one.
Relative temporal-difference (TD) learning was introduced to mitigate the slow convergence of TD methods when the discount factor approaches one by subtracting a baseline from the temporal-difference update. While this idea has been studied in the tabular setting, stability guarantees with function approximation remain poorly understood. This paper analyzes relative TD learning with linear function approximation. We establish stability conditions for the algorithm and show that the choice of baseline distribution plays a central role. In particular, when the baseline is chosen as the empirical distribution of the state-action process, the algorithm is stable for any non-negative baseline weight and any discount factor. We also provide a sensitivity analysis of the resulting parameter estimates, characterizing both asymptotic bias and covariance. The asymptotic covariance and asymptotic bias are shown to remain uniformly bounded as the discount factor approaches one.