LGAIMLJun 19, 2020

An operator view of policy gradient methods

arXiv:2006.11266v328 citations
AI Analysis

This work provides a theoretical framework for reinforcement learning researchers to better analyze and connect existing algorithms, though it is incremental in nature.

The paper tackles the problem of understanding and unifying policy gradient methods in reinforcement learning by framing them as repeated applications of a policy improvement operator and a projection operator, leading to a new global lower bound for expected return and bridging policy-based and value-based methods.

We cast policy gradient methods as the repeated application of two operators: a policy improvement operator $\mathcal{I}$, which maps any policy $π$ to a better one $\mathcal{I}π$, and a projection operator $\mathcal{P}$, which finds the best approximation of $\mathcal{I}π$ in the set of realizable policies. We use this framework to introduce operator-based versions of traditional policy gradient methods such as REINFORCE and PPO, which leads to a better understanding of their original counterparts. We also use the understanding we develop of the role of $\mathcal{I}$ and $\mathcal{P}$ to propose a new global lower bound of the expected return. This new perspective allows us to further bridge the gap between policy-based and value-based methods, showing how REINFORCE and the Bellman optimality operator, for example, can be seen as two sides of the same coin.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes