MLLGAug 14, 2015

Emphatic TD Bellman Operator is a Contraction

arXiv:1508.03411v27 citations
AI Analysis

This provides the first error bounds for off-policy evaluation in reinforcement learning, addressing a theoretical gap for researchers and practitioners in the field, though it is incremental as it builds on existing ETD work.

The paper tackled the problem of proving error bounds for off-policy evaluation algorithms by showing that the emphatic temporal differences (ETD) algorithm involves a contraction operator with a sqrt(gamma)-contraction modulus, enabling the derivation of error bounds for ETD under general target and behavior policies.

Recently, \citet{SuttonMW15} introduced the emphatic temporal differences (ETD) algorithm for off-policy evaluation in Markov decision processes. In this short note, we show that the projected fixed-point equation that underlies ETD involves a contraction operator, with a $\sqrtγ$-contraction modulus (where $γ$ is the discount factor). This allows us to provide error bounds on the approximation error of ETD. To our knowledge, these are the first error bounds for an off-policy evaluation algorithm under general target and behavior policies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes