LGDec 20, 2022

Policy Gradient in Robust MDPs with Global Convergence Guarantee

arXiv:2212.10439v244 citationsh-index: 22
Originality Highly original
AI Analysis

This work addresses the problem of computing reliable policies under model errors for reinforcement learning practitioners, representing a novel method rather than an incremental improvement.

The paper tackles the challenge of adapting policy gradient methods to robust Markov decision processes (RMDPs) by proposing the Double-Loop Robust Policy Gradient (DRPG), which guarantees global convergence to an optimal policy in tabular RMDPs, as confirmed by numerical results.

Robust Markov decision processes (RMDPs) provide a promising framework for computing reliable policies in the face of model errors. Many successful reinforcement learning algorithms build on variations of policy-gradient methods, but adapting these methods to RMDPs has been challenging. As a result, the applicability of RMDPs to large, practical domains remains limited. This paper proposes a new Double-Loop Robust Policy Gradient (DRPG), the first generic policy gradient method for RMDPs. In contrast with prior robust policy gradient algorithms, DRPG monotonically reduces approximation errors to guarantee convergence to a globally optimal policy in tabular RMDPs. We introduce a novel parametric transition kernel and solve the inner loop robust policy via a gradient-based method. Finally, our numerical results demonstrate the utility of our new algorithm and confirm its global convergence properties.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes