LGMLApr 17, 2017

O$^2$TD: (Near)-Optimal Off-Policy TD Learning

arXiv:1704.05147v21 citations
Originality Highly original
AI Analysis

This addresses the challenge of suboptimal objective functions in widely used TD methods for reinforcement learning, offering improved efficiency and stability in off-policy scenarios.

The paper tackles the problem of approximating the true value function V in off-policy temporal difference learning, proposing two novel algorithms that achieve near-optimal performance with linear computational cost per step.

Temporal difference learning and Residual Gradient methods are the most widely used temporal difference based learning algorithms; however, it has been shown that none of their objective functions is optimal w.r.t approximating the true value function $V$. Two novel algorithms are proposed to approximate the true value function $V$. This paper makes the following contributions: (1) A batch algorithm that can help find the approximate optimal off-policy prediction of the true value function $V$. (2) A linear computational cost (per step) near-optimal algorithm that can learn from a collection of off-policy samples. (3) A new perspective of the emphatic temporal difference learning which bridges the gap between off-policy optimality and off-policy stability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes