MLLGFeb 8, 2025

Convergence of TD(0) under Polynomial Mixing with Nonlinear Function Approximation

arXiv:2502.05706v2
Originality Highly original
AI Analysis

This provides the first rigorous convergence guarantee for vanilla TD(0) under realistic mixing conditions, addressing a fundamental gap in reinforcement learning theory for practitioners dealing with non-i.i.d. data.

The paper tackles the problem of analyzing the finite-sample behavior of TD(0) reinforcement learning under non-i.i.d. data and nonlinear function approximation, showing that with high probability, the error converges at rates matching known i.i.d. bounds, specifically with terms scaling as t^{-β/2} and t^{-ηγ} after O(1/ε^2) iterations.

Temporal Difference Learning (TD(0)) is fundamental in reinforcement learning, yet its finite-sample behavior under non-i.i.d. data and nonlinear approximation remains unknown. We provide the first high-probability, finite-sample analysis of vanilla TD(0) on polynomially mixing Markov data, assuming only Holder continuity and bounded generalized gradients. This breaks with previous work, which often requires subsampling, projections, or instance-dependent step-sizes. Concretely, for mixing exponent $β> 1$, Holder continuity exponent $γ$, and step-size decay rate $η\in (1/2, 1]$, we show that, with high probability, \[ \| θ_t - θ^* \| \leq C(β, γ, η)\, t^{-β/2} + C'(γ, η)\, t^{-ηγ} \] after $t = \mathcal{O}(1/\varepsilon^2)$ iterations. These bounds match the known i.i.d. rates and hold even when initialization is nonstationary. Central to our proof is a novel discrete-time coupling that bypasses geometric ergodicity, yielding the first such guarantee for nonlinear TD(0) under realistic mixing.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes