LG MLJul 4, 2022

General Policy Evaluation and Improvement by Learning to Identify Few But Crucial States

Francesco Faccio, Aditya Ramesh, Vincent Herrmann, Jean Harb, Jürgen Schmidhuber

arXiv:2207.01566v112.412 citationsh-index: 100Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of policy evaluation and improvement in RL, offering a method that is invariant to policy architecture changes, though it appears incremental as it builds on existing actor-critic and embedding techniques.

The paper tackles the problem of evaluating and improving policies in reinforcement learning by learning a single value function that works for many policies, using a small set of learned 'probing states' to predict returns. It achieves competitive results, such as cloning near-optimal policies in Swimmer-v3 and Hopper-v3 with only 3 and 5 states, respectively, and enables zero-shot learning of linear policies.

Learning to evaluate and improve policies is a core problem of Reinforcement Learning (RL). Traditional RL algorithms learn a value function defined for a single policy. A recently explored competitive alternative is to learn a single value function for many policies. Here we combine the actor-critic architecture of Parameter-Based Value Functions and the policy embedding of Policy Evaluation Networks to learn a single value function for evaluating (and thus helping to improve) any policy represented by a deep neural network (NN). The method yields competitive experimental results. In continuous control problems with infinitely many states, our value function minimizes its prediction error by simultaneously learning a small set of `probing states' and a mapping from actions produced in probing states to the policy's return. The method extracts crucial abstract knowledge about the environment in form of very few states sufficient to fully specify the behavior of many policies. A policy improves solely by changing actions in probing states, following the gradient of the value function's predictions. Surprisingly, it is possible to clone the behavior of a near-optimal policy in Swimmer-v3 and Hopper-v3 environments only by knowing how to act in 3 and 5 such learned states, respectively. Remarkably, our value function trained to evaluate NN policies is also invariant to changes of the policy architecture: we show that it allows for zero-shot learning of linear policies competitive with the best policy seen during training. Our code is public.

View on arXiv PDF Code

Similar