Finite-Sample Analysis of Off-Policy TD-Learning via Generalized Bellman Operators
This work addresses the high variance issue in off-policy reinforcement learning, offering theoretical guarantees for practical algorithms, though it is incremental as it builds on existing operator analysis.
The paper tackles the problem of deriving finite-sample bounds for off-policy TD-learning algorithms by analyzing generalized Bellman operators, providing first-known guarantees for Qπ(λ), Tree-Backup(λ), and Retrace(λ), and improving bounds for Q-trace, while showing bias-variance trade-offs.
In temporal difference (TD) learning, off-policy sampling is known to be more practical than on-policy sampling, and by decoupling learning from data collection, it enables data reuse. It is known that policy evaluation (including multi-step off-policy importance sampling) has the interpretation of solving a generalized Bellman equation. In this paper, we derive finite-sample bounds for any general off-policy TD-like stochastic approximation algorithm that solves for the fixed-point of this generalized Bellman operator. Our key step is to show that the generalized Bellman operator is simultaneously a contraction mapping with respect to a weighted $\ell_p$-norm for each $p$ in $[1,\infty)$, with a common contraction factor. Off-policy TD-learning is known to suffer from high variance due to the product of importance sampling ratios. A number of algorithms (e.g. $Q^π(λ)$, Tree-Backup$(λ)$, Retrace$(λ)$, and $Q$-trace) have been proposed in the literature to address this issue. Our results immediately imply finite-sample bounds of these algorithms. In particular, we provide first-known finite-sample guarantees for $Q^π(λ)$, Tree-Backup$(λ)$, and Retrace$(λ)$, and improve the best known bounds of $Q$-trace in [19]. Moreover, we show the bias-variance trade-offs in each of these algorithms.