ML LGSep 17, 2015

Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

Assaf Hallak, Aviv Tamar, Remi Munos, Shie Mannor

arXiv:1509.05172v221.861 citations

Originality Incremental advance

AI Analysis

This work addresses off-policy evaluation for reinforcement learning practitioners, offering an incremental improvement by extending existing ETD methods with theoretical guarantees.

The authors tackled the off-policy evaluation problem in Markov decision processes by proposing a generalization of emphatic temporal differences (ETD) that includes a parameter β to control bias-variance trade-offs, resulting in a framework that achieves lower total error through bias reduction and variance control.

We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced \emph{emphatic temporal differences} (ETD) algorithm \citep{SuttonMW15}, which encompasses the original ETD($λ$), as well as several other off-policy evaluation algorithms as special cases. We call this framework \ETD, where our introduced parameter $β$ controls the decay rate of an importance-sampling term. We study conditions under which the projected fixed-point equation underlying \ETD\ involves a contraction operator, allowing us to present the first asymptotic error bounds (bias) for \ETD. Our results show that the original ETD algorithm always involves a contraction operator, and its bias is bounded. Moreover, by controlling $β$, our proposed generalization allows trading-off bias for variance reduction, thereby achieving a lower total error.

View on arXiv PDF

Similar