LGJun 2, 2023

ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages

Andrew Jesson, Chris Lu, Gunshi Gupta, Nicolas Beltran-Velez, Angelos Filos, Jakob Nicolaus Foerster, Yarin Gal

arXiv:2306.01460v412.311 citationsh-index: 64Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of sample efficiency and stability in deep reinforcement learning for researchers and practitioners, though it is incremental as it builds on existing A3C methods.

The paper tackles the problem of improving on-policy actor-critic deep reinforcement learning by proposing three modifications to the A3C algorithm: applying ReLU to advantage estimates, spectral normalization of weights, and dropout for Bayesian approximation. The result is significant improvements in median and interquartile mean metrics over A3C, PPO, SAC, and TD3 on MuJoCo continuous control and over PPO on the ProcGen generalization benchmark.

This paper proposes a step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning. It is implemented through three changes to the Asynchronous Advantage Actor-Critic (A3C) algorithm: (1) applying a ReLU function to advantage estimates, (2) spectral normalization of actor-critic weights, and (3) incorporating \emph{dropout as a Bayesian approximation}. We prove under standard assumptions that restricting policy updates to positive advantages optimizes for value by maximizing a lower bound on the value function plus an additive term. We show that the additive term is bounded proportional to the Lipschitz constant of the value function, which offers theoretical grounding for spectral normalization of critic weights. Finally, our application of dropout corresponds to approximate Bayesian inference over both the actor and critic parameters, which enables \textit{adaptive state-aware} exploration around the modes of the actor via Thompson sampling. We demonstrate significant improvements for median and interquartile mean metrics over A3C, PPO, SAC, and TD3 on the MuJoCo continuous control benchmark and improvement over PPO in the challenging ProcGen generalization benchmark.

View on arXiv PDF Code

Similar