ML LGAug 14, 2015

Emphatic TD Bellman Operator is a Contraction

arXiv:1508.03411v28.67 citations

Originality Incremental advance

AI Analysis

This provides the first error bounds for off-policy evaluation in reinforcement learning, addressing a theoretical gap for researchers and practitioners in the field, though it is incremental as it builds on existing ETD work.

The paper tackled the problem of proving error bounds for off-policy evaluation algorithms by showing that the emphatic temporal differences (ETD) algorithm involves a contraction operator with a sqrt(gamma)-contraction modulus, enabling the derivation of error bounds for ETD under general target and behavior policies.

Recently, \citet{SuttonMW15} introduced the emphatic temporal differences (ETD) algorithm for off-policy evaluation in Markov decision processes. In this short note, we show that the projected fixed-point equation that underlies ETD involves a contraction operator, with a $\sqrtγ$-contraction modulus (where $γ$ is the discount factor). This allows us to provide error bounds on the approximation error of ETD. To our knowledge, these are the first error bounds for an off-policy evaluation algorithm under general target and behavior policies.

View on arXiv PDF

Similar