LG AIDec 11, 2020

OPAC: Opportunistic Actor-Critic

Srinjoy Roy, Saptam Bakshi, Tamal Maharaj

arXiv:2012.06555v14.23 citationsh-index: 1

Originality Incremental advance

AI Analysis

This work provides an incremental improvement for reinforcement learning practitioners working on continuous control tasks by offering a new algorithm that combines features of existing methods.

This paper introduces Opportunistic Actor-Critic (OPAC), a novel model-free deep reinforcement learning algorithm that addresses inefficient exploration and sub-optimal policies in existing actor-critic methods. OPAC achieves state-of-the-art performance on MuJoCo environments, outperforming or at least equaling TD3 and SAC.

Actor-critic methods, a type of model-free reinforcement learning (RL), have achieved state-of-the-art performances in many real-world domains in continuous control. Despite their success, the wide-scale deployment of these models is still a far cry. The main problems in these actor-critic methods are inefficient exploration and sub-optimal policies. Soft Actor-Critic (SAC) and Twin Delayed Deep Deterministic Policy Gradient (TD3), two cutting edge such algorithms, suffer from these issues. SAC effectively addressed the problems of sample complexity and convergence brittleness to hyper-parameters and thus outperformed all state-of-the-art algorithms including TD3 in harder tasks, whereas TD3 produced moderate results in all environments. SAC suffers from inefficient exploration owing to the Gaussian nature of its policy which causes borderline performance in simpler tasks. In this paper, we introduce Opportunistic Actor-Critic (OPAC), a novel model-free deep RL algorithm that employs better exploration policy and lesser variance. OPAC combines some of the most powerful features of TD3 and SAC and aims to optimize a stochastic policy in an off-policy way. For calculating the target Q-values, instead of two critics, OPAC uses three critics and based on the environment complexity, opportunistically chooses how the target Q-value is computed from the critics' evaluation. We have systematically evaluated the algorithm on MuJoCo environments where it achieves state-of-the-art performance and outperforms or at least equals the performance of TD3 and SAC.

View on arXiv PDF

Similar