LGMLJun 21, 2021

Emphatic Algorithms for Deep Reinforcement Learning

arXiv:2106.11779v123 citations
Originality Incremental advance
AI Analysis

This addresses the 'deadly triad' problem for reinforcement learning practitioners, offering incremental improvements in stability and performance.

The paper tackled the instability of temporal difference learning in off-policy deep reinforcement learning by extending emphatic methods to deep agents, deriving new algorithms that showed noticeable benefits in small problems and improved performance on Atari games.

Off-policy learning allows us to learn about possible policies of behavior from experience generated by a different behavior policy. Temporal difference (TD) learning algorithms can become unstable when combined with function approximation and off-policy sampling - this is known as the ''deadly triad''. Emphatic temporal difference (ETD($λ$)) algorithm ensures convergence in the linear case by appropriately weighting the TD($λ$) updates. In this paper, we extend the use of emphatic methods to deep reinforcement learning agents. We show that naively adapting ETD($λ$) to popular deep reinforcement learning algorithms, which use forward view multi-step returns, results in poor performance. We then derive new emphatic algorithms for use in the context of such algorithms, and we demonstrate that they provide noticeable benefits in small problems designed to highlight the instability of TD methods. Finally, we observed improved performance when applying these algorithms at scale on classic Atari games from the Arcade Learning Environment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes