LG AI MLFeb 26, 2020

Policy Evaluation Networks

Jean Harb, Tom Schaul, Doina Precup, Pierre-Luc Bacon

arXiv:2002.11833v117.941 citations

Originality Highly original

AI Analysis

This addresses a bottleneck in reinforcement learning for researchers by enabling more efficient policy optimization.

The paper tackles the problem of estimating values for many policies on a fixed set of states, enabling zero-shot policy improvement without new data, and demonstrates that this approach outperforms the policies that generated the training data.

Many reinforcement learning algorithms use value functions to guide the search for better policies. These methods estimate the value of a single policy while generalizing across many states. The core idea of this paper is to flip this convention and estimate the value of many policies, for a single set of states. This approach opens up the possibility of performing direct gradient ascent in policy space without seeing any new data. The main challenge for this approach is finding a way to represent complex policies that facilitates learning and generalization. To address this problem, we introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding. Our empirical results demonstrate that combining these three elements (learned Policy Evaluation Network, policy fingerprints, gradient ascent) can produce policies that outperform those that generated the training data, in zero-shot manner.

View on arXiv PDF

Similar