LGOct 15, 2021

Wasserstein Unsupervised Reinforcement Learning

Shuncheng He, Yuhang Jiang, Hongchang Zhang, Jianzhun Shao, Xiangyang Ji

arXiv:2110.07940v114.632 citations

Originality Incremental advance

AI Analysis

This work addresses a key bottleneck in unsupervised skill discovery for reinforcement learning agents, enabling better exploration and downstream task performance, though it is incremental as it builds on existing mutual information approaches.

The paper tackles the problem of insufficient state space exploration in unsupervised reinforcement learning by proposing Wasserstein Unsupervised Reinforcement Learning (WURL), which directly maximizes the distance between state distributions of different policies, resulting in policies that outperform mutual information-based methods on Wasserstein distance metrics while maintaining high discriminability.

Unsupervised reinforcement learning aims to train agents to learn a handful of policies or skills in environments without external reward. These pre-trained policies can accelerate learning when endowed with external reward, and can also be used as primitive options in hierarchical reinforcement learning. Conventional approaches of unsupervised skill discovery feed a latent variable to the agent and shed its empowerment on agent's behavior by mutual information (MI) maximization. However, the policies learned by MI-based methods cannot sufficiently explore the state space, despite they can be successfully identified from each other. Therefore we propose a new framework Wasserstein unsupervised reinforcement learning (WURL) where we directly maximize the distance of state distributions induced by different policies. Additionally, we overcome difficulties in simultaneously training N(N >2) policies, and amortizing the overall reward to each step. Experiments show policies learned by our approach outperform MI-based methods on the metric of Wasserstein distance while keeping high discriminability. Furthermore, the agents trained by WURL can sufficiently explore the state space in mazes and MuJoCo tasks and the pre-trained policies can be applied to downstream tasks by hierarchical learning.

View on arXiv PDF

Similar