Convergence of TD(0) under Polynomial Mixing with Nonlinear Function Approximation
This provides the first rigorous convergence guarantee for vanilla TD(0) under realistic mixing conditions, addressing a fundamental gap in reinforcement learning theory for practitioners dealing with non-i.i.d. data.
The paper tackles the problem of analyzing the finite-sample behavior of TD(0) reinforcement learning under non-i.i.d. data and nonlinear function approximation, showing that with high probability, the error converges at rates matching known i.i.d. bounds, specifically with terms scaling as t^{-β/2} and t^{-ηγ} after O(1/ε^2) iterations.
Temporal Difference Learning (TD(0)) is fundamental in reinforcement learning, yet its finite-sample behavior under non-i.i.d. data and nonlinear approximation remains unknown. We provide the first high-probability, finite-sample analysis of vanilla TD(0) on polynomially mixing Markov data, assuming only Holder continuity and bounded generalized gradients. This breaks with previous work, which often requires subsampling, projections, or instance-dependent step-sizes. Concretely, for mixing exponent $β> 1$, Holder continuity exponent $γ$, and step-size decay rate $η\in (1/2, 1]$, we show that, with high probability, \[ \| θ_t - θ^* \| \leq C(β, γ, η)\, t^{-β/2} + C'(γ, η)\, t^{-ηγ} \] after $t = \mathcal{O}(1/\varepsilon^2)$ iterations. These bounds match the known i.i.d. rates and hold even when initialization is nonstationary. Central to our proof is a novel discrete-time coupling that bypasses geometric ergodicity, yielding the first such guarantee for nonlinear TD(0) under realistic mixing.