Analysis of Off-Policy $n$-Step TD-Learning with Linear Function Approximation
This provides theoretical guarantees for reinforcement learning methods in complex settings, though it appears incremental as it builds on existing deterministic algorithms.
The paper tackles the convergence of multi-step TD-learning algorithms in the challenging 'deadly triad' scenario with linear function approximation, off-policy learning, and bootstrapping, proving that these algorithms converge to a solution as the sampling horizon n increases sufficiently.
This paper analyzes multi-step temporal difference (TD)-learning algorithms within the ``deadly triad'' scenario, characterized by linear function approximation, off-policy learning, and bootstrapping. In particular, we prove that $n$-step TD-learning algorithms converge to a solution as the sampling horizon $n$ increases sufficiently. The paper is divided into two parts. In the first part, we comprehensively examine the fundamental properties of their model-based deterministic counterparts, including projected value iteration, gradient descent algorithms, which can be viewed as prototype deterministic algorithms whose analysis plays a pivotal role in understanding and developing their model-free reinforcement learning counterparts. In particular, we prove that these algorithms converge to meaningful solutions when $n$ is sufficiently large. Based on these findings, in the second part, two $n$-step TD-learning algorithms are proposed and analyzed, which can be seen as the model-free reinforcement learning counterparts of the model-based deterministic algorithms.