ML LGJun 17, 2022

Generalised Policy Improvement with Geometric Policy Composition

Shantanu Thakoor, Mark Rowland, Diana Borsa, Will Dabney, Rémi Munos, André Barreto

arXiv:2206.08736v114.514 citationsh-index: 88

Originality Incremental advance

AI Analysis

This work addresses policy improvement for RL practitioners by offering a novel hybrid approach, though it appears incremental as it builds on existing concepts like GHMs and GPI.

The paper tackles the problem of policy improvement in reinforcement learning by introducing a method that interpolates between value-based and model-based RL, using geometric horizon models to compose non-Markov policies and applying generalized policy improvement to outperform precursors, with empirical demonstration on a challenging deep RL continuous control task.

We introduce a method for policy improvement that interpolates between the greedy approach of value-based reinforcement learning (RL) and the full planning approach typical of model-based RL. The new method builds on the concept of a geometric horizon model (GHM, also known as a gamma-model), which models the discounted state-visitation distribution of a given policy. We show that we can evaluate any non-Markov policy that switches between a set of base Markov policies with fixed probability by a careful composition of the base policy GHMs, without any additional learning. We can then apply generalised policy improvement (GPI) to collections of such non-Markov policies to obtain a new Markov policy that will in general outperform its precursors. We provide a thorough theoretical analysis of this approach, develop applications to transfer and standard RL, and empirically demonstrate its effectiveness over standard GPI on a challenging deep RL continuous control task. We also provide an analysis of GHM training methods, proving a novel convergence result regarding previously proposed methods and showing how to train these models stably in deep RL settings.

View on arXiv PDF

Similar