LGMAMLJul 24, 2019

Terminal Prediction as an Auxiliary Task for Deep Reinforcement Learning

arXiv:1907.10827v131 citations
Originality Incremental advance
AI Analysis

This work addresses sample inefficiency and convergence issues in deep reinforcement learning for episodic tasks, representing an incremental improvement through a novel auxiliary task.

The paper tackles the challenges of convergence to locally optimal policies and sample inefficiency in deep reinforcement learning by introducing Terminal Prediction (TP) as a self-supervised auxiliary task to estimate temporal closeness to terminal states, resulting in A3C-TP outperforming standard A3C in most tested domains like Atari games and BipedalWalker, with significant improvements in learning efficiency and policy convergence in Pommerman.

Deep reinforcement learning has achieved great successes in recent years, but there are still open challenges, such as convergence to locally optimal policies and sample inefficiency. In this paper, we contribute a novel self-supervised auxiliary task, i.e., Terminal Prediction (TP), estimating temporal closeness to terminal states for episodic tasks. The intuition is to help representation learning by letting the agent predict how close it is to a terminal state, while learning its control policy. Although TP could be integrated with multiple algorithms, this paper focuses on Asynchronous Advantage Actor-Critic (A3C) and demonstrating the advantages of A3C-TP. Our extensive evaluation includes: a set of Atari games, the BipedalWalker domain, and a mini version of the recently proposed multi-agent Pommerman game. Our results on Atari games and the BipedalWalker domain suggest that A3C-TP outperforms standard A3C in most of the tested domains and in others it has similar performance. In Pommerman, our proposed method provides significant improvement both in learning efficiency and converging to better policies against different opponents.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes