LGMLFeb 7, 2019

Compatible Natural Gradient Policy Search

arXiv:1902.02823v126 citations
Originality Highly original
AI Analysis

This addresses premature convergence in policy search for reinforcement learning, offering a novel method with broad applicability.

The paper tackled premature convergence in natural gradient policy search by introducing COPOS, a method that bounds entropy loss, and demonstrated state-of-the-art results in continuous control and discrete partially observable tasks.

Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes