LGJul 15, 2022

The Nature of Temporal Difference Errors in Multi-step Distributional Reinforcement Learning

Yunhao Tang, Mark Rowland, Rémi Munos, Bernardo Ávila Pires, Will Dabney, Marc G. Bellemare

arXiv:2207.07570v110.412 citationsh-index: 88

Originality Highly original

AI Analysis

This work addresses fundamental challenges in distributional RL for AI researchers, offering both theoretical insights and practical algorithmic improvements.

The paper tackles the problem of multi-step off-policy learning in distributional reinforcement learning by identifying a novel path-dependent distributional TD error, providing the first theoretical guarantees for such algorithms and deriving Quantile Regression-Retrace, which improves QR-DQN on the Atari-57 benchmark.

We study the multi-step off-policy learning approach to distributional RL. Despite the apparent similarity between value-based RL and distributional RL, our study reveals intriguing and fundamental differences between the two cases in the multi-step setting. We identify a novel notion of path-dependent distributional TD error, which is indispensable for principled multi-step distributional RL. The distinction from the value-based case bears important implications on concepts such as backward-view algorithms. Our work provides the first theoretical guarantees on multi-step off-policy distributional RL algorithms, including results that apply to the small number of existing approaches to multi-step distributional RL. In addition, we derive a novel algorithm, Quantile Regression-Retrace, which leads to a deep RL agent QR-DQN-Retrace that shows empirical improvements over QR-DQN on the Atari-57 benchmark. Collectively, we shed light on how unique challenges in multi-step distributional RL can be addressed both in theory and practice.

View on arXiv PDF

Similar