LGMar 14, 2015

An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning

arXiv:1503.04269v2296 citations
AI Analysis

This addresses the problem of stable off-policy learning in reinforcement learning for researchers and practitioners, offering a more efficient alternative to existing methods.

The paper tackles the instability of off-policy temporal-difference learning by introducing emphatic TD(λ), which selectively emphasizes updates to achieve stable expected updates under off-policy training. The result is a simpler method with only one learned parameter vector and one step-size parameter, compared to prior gradient-TD methods.

In this paper we introduce the idea of improving the performance of parametric temporal-difference (TD) learning algorithms by selectively emphasizing or de-emphasizing their updates on different time steps. In particular, we show that varying the emphasis of linear TD($λ$)'s updates in a particular way causes its expected update to become stable under off-policy training. The only prior model-free TD methods to achieve this with per-step computation linear in the number of function approximation parameters are the gradient-TD family of methods including TDC, GTD($λ$), and GQ($λ$). Compared to these methods, our _emphatic TD($λ$)_ is simpler and easier to use; it has only one learned parameter vector and one step-size parameter. Our treatment includes general state-dependent discounting and bootstrapping functions, and a way of specifying varying degrees of interest in accurately valuing different states.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes