LG MLOct 23, 2018

Reconciling $λ$-Returns with Experience Replay

arXiv:1810.09967v37.53 citationsHas Code

Originality Incremental advance

AI Analysis

This work addresses a specific bottleneck in replay-based reinforcement learning for researchers and practitioners, offering incremental improvements to sample efficiency and performance.

The paper tackles the challenge of integrating λ-returns into off-policy deep reinforcement learning methods that use experience replay, which is problematic due to random minibatch sampling, and proposes a method that enables efficient computation by caching short sequences and precomputing λ-returns, resulting in enhanced performance of DQN on Atari 2600 games, even under partial observability.

Modern deep reinforcement learning methods have departed from the incremental learning required for eligibility traces, rendering the implementation of the $λ$-return difficult in this context. In particular, off-policy methods that utilize experience replay remain problematic because their random sampling of minibatches is not conducive to the efficient calculation of $λ$-returns. Yet replay-based methods are often the most sample efficient, and incorporating $λ$-returns into them is a viable way to achieve new state-of-the-art performance. Towards this, we propose the first method to enable practical use of $λ$-returns in arbitrary replay-based methods without relying on other forms of decorrelation such as asynchronous gradient updates. By promoting short sequences of past transitions into a small cache within the replay memory, adjacent $λ$-returns can be efficiently precomputed by sharing Q-values. Computation is not wasted on experiences that are never sampled, and stored $λ$-returns behave as stable temporal-difference (TD) targets that replace the target network. Additionally, our method grants the unique ability to observe TD errors prior to sampling; for the first time, transitions can be prioritized by their true significance rather than by a proxy to it. Furthermore, we propose the novel use of the TD error to dynamically select $λ$-values that facilitate faster learning. We show that these innovations can enhance the performance of DQN when playing Atari 2600 games, even under partial observability. While our work specifically focuses on $λ$-returns, these ideas are applicable to any multi-step return estimator.

View on arXiv PDF Code

Similar