MLLGSTNov 7, 2022

Policy evaluation from a single path: Multi-step methods, mixing and mis-specification

arXiv:2211.03899v12 citationsh-index: 100
Originality Incremental advance
AI Analysis

This addresses the problem of efficient policy evaluation in reinforcement learning with limited data, offering theoretical guarantees for practitioners, though it is incremental as it builds on existing TD methods.

The paper tackles non-parametric estimation of value functions in Markov reward processes using single-trajectory data, providing non-asymptotic error bounds for kernel-based multi-step TD methods that show statistical error is similar to i.i.d. data under correct specification but inflated under mis-specification, with increased look-ahead mitigating this deterioration.

We study non-parametric estimation of the value function of an infinite-horizon $γ$-discounted Markov reward process (MRP) using observations from a single trajectory. We provide non-asymptotic guarantees for a general family of kernel-based multi-step temporal difference (TD) estimates, including canonical $K$-step look-ahead TD for $K = 1, 2, \ldots$ and the TD$(λ)$ family for $λ\in [0,1)$ as special cases. Our bounds capture its dependence on Bellman fluctuations, mixing time of the Markov chain, any mis-specification in the model, as well as the choice of weight function defining the estimator itself, and reveal some delicate interactions between mixing time and model mis-specification. For a given TD method applied to a well-specified model, its statistical error under trajectory data is similar to that of i.i.d. sample transition pairs, whereas under mis-specification, temporal dependence in data inflates the statistical error. However, any such deterioration can be mitigated by increased look-ahead. We complement our upper bounds by proving minimax lower bounds that establish optimality of TD-based methods with appropriately chosen look-ahead and weighting, and reveal some fundamental differences between value function estimation and ordinary non-parametric regression.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes