LGAIMLJun 8, 2016

Safe and Efficient Off-Policy Reinforcement Learning

arXiv:1606.02647v2688 citations
Originality Highly original
AI Analysis

This work provides a foundational advancement for reinforcement learning practitioners by enabling safer and more efficient off-policy learning, with broad implications across AI domains.

The paper tackles the problem of off-policy reinforcement learning by introducing Retrace(λ), a novel algorithm that safely uses samples from any behavior policy with low variance and efficiency, and proves its convergence to Q* without the GLIE assumption, while also resolving the long-standing open problem of Watkins' Q(λ) convergence.

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace($λ$), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to $Q^*$ without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q($λ$), which was an open problem since 1989. We illustrate the benefits of Retrace($λ$) on a standard suite of Atari 2600 games.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes