LGDec 19, 2017

On Wasserstein Reinforcement Learning and the Fokker-Planck equation

arXiv:1712.07185v126 citations
Originality Incremental advance
AI Analysis

This work addresses convergence issues in policy gradient methods for reinforcement learning, offering theoretical insights that are incremental but clarify empirical practices.

The paper tackles the problem of limiting policy changes in reinforcement learning by using Wasserstein distance instead of Kullback-Leibler divergence, showing that in the small-step limit, policy dynamics follow the Fokker-Planck equation, which helps explain convergence and justifies practices like Gaussian priors and gradient noise.

Policy gradients methods often achieve better performance when the change in policy is limited to a small Kullback-Leibler divergence. We derive policy gradients where the change in policy is limited to a small Wasserstein distance (or trust region). This is done in the discrete and continuous multi-armed bandit settings with entropy regularisation. We show that in the small steps limit with respect to the Wasserstein distance $W_2$, policy dynamics are governed by the Fokker-Planck (heat) equation, following the Jordan-Kinderlehrer-Otto result. This means that policies undergo diffusion and advection, concentrating near actions with high reward. This helps elucidate the nature of convergence in the probability matching setup, and provides justification for empirical practices such as Gaussian policy priors and additive gradient noise.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes