LG MLMay 29

Convergence of Two-Timescale Markovian Stochastic Approximations with Applications in Reinforcement Learning

Vagul Mahadevan, Claire Chen, Shuze Daniel Liu, Shangtong Zhang

arXiv:2605.3117266.5

Predicted impact top 29% in LG · last 90 daysOriginality Highly original

AI Analysis

This work provides a more robust theoretical foundation for the convergence of two-timescale stochastic approximation algorithms, which is crucial for the reliable application of methods like actor-critic and TDC in reinforcement learning.

This paper addresses the convergence of two-timescale stochastic approximations (SA) by establishing their stability and convergence under Markovian noise, a more realistic setting for reinforcement learning (RL) than the previously studied i.i.d. noise. As a result, the authors achieve the first almost sure convergence of Temporal Difference learning with Gradient Correction (TDC) with eligibility traces under off-policy learning with linear function approximation.

This work studies the convergence of two-timescale stochastic approximations (SA), a class of iterative algorithms that update two sets of parameters in fast and slow timescales respectively. Notable examples of two-timescale SA in reinforcement learning (RL) include temporal difference learning with gradient correction (TDC) and actor-critic methods. Previously, the stability (i.e., boundedness) and convergence of two-timescale SA were only established under i.i.d. noise. This work instead establishes the stability and convergence of two-timescale SA under Markovian noise, a setup that is more realistic in RL. Notably, we do not need to use any projection operator and the noise does not need to live in a compact space. Our key technical novelty is to control the fast timescale parameter with the running max of the slow timescale parameter, instead of with the current slow timescale parameter, as most prior works do. As a key application, we establish the first almost sure convergence of TDC with eligibility traces under off-policy learning with linear function approximation.

View on arXiv PDF

Similar