LGDec 22, 2017

A short variational proof of equivalence between policy gradients and soft Q learning

arXiv:1712.08650v18 citations
Originality Synthesis-oriented
AI Analysis

This work offers a theoretical insight for reinforcement learning researchers, but it is incremental as it builds on known equivalences and duality results.

The paper provides a short variational proof of the equivalence between policy gradients and soft Q-learning, leveraging convex duality of Shannon entropy and the softmax function, and introduces a new policy inequality relative to soft Q-learning.

Two main families of reinforcement learning algorithms, Q-learning and policy gradients, have recently been proven to be equivalent when using a softmax relaxation on one part, and an entropic regularization on the other. We relate this result to the well-known convex duality of Shannon entropy and the softmax function. Such a result is also known as the Donsker-Varadhan formula. This provides a short proof of the equivalence. We then interpret this duality further, and use ideas of convex analysis to prove a new policy inequality relative to soft Q-learning.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes