LGAIMLSep 25, 2019

Off-Policy Actor-Critic with Shared Experience Replay

arXiv:1909.11583v271 citations
Originality Incremental advance
AI Analysis

This work addresses stability and efficiency issues in reinforcement learning for researchers and practitioners, though it appears incremental as it builds on existing methods like V-trace.

The paper tackles the challenges of combining actor-critic reinforcement learning with large-scale experience replay, proposing solutions for efficient learning and stability in off-policy settings, and demonstrates state-of-the-art data efficiency on Atari with training up to 200M frames.

We investigate the combination of actor-critic reinforcement learning algorithms with uniform large-scale experience replay and propose solutions for two challenges: (a) efficient actor-critic learning with experience replay (b) stability of off-policy learning where agents learn from other agents behaviour. We employ those insights to accelerate hyper-parameter sweeps in which all participating agents run concurrently and share their experience via a common replay module. To this end we analyze the bias-variance tradeoffs in V-trace, a form of importance sampling for actor-critic methods. Based on our analysis, we then argue for mixing experience sampled from replay with on-policy experience, and propose a new trust region scheme that scales effectively to data distributions where V-trace becomes unstable. We provide extensive empirical validation of the proposed solution. We further show the benefits of this setup by demonstrating state-of-the-art data efficiency on Atari among agents trained up until 200M environment frames.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes