LGJan 31, 2024

A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees

arXiv:2401.17780v35 citationsh-index: 15
Originality Highly original
AI Analysis

This work addresses a key theoretical gap in reinforcement learning for constrained decision-making, providing the first Uniform-PAC algorithm for online CMDPs, which is incremental as it builds on existing primal-dual methods but with improved guarantees.

The paper tackles the problem of online constrained Markov decision processes (CMDPs) by introducing a policy gradient primal-dual algorithm that ensures convergence to optimal policies, sublinear regret, and polynomial sample complexity, with empirical validation showing it outperforms baselines by avoiding oscillations and constraint violations.

We study a primal-dual (PD) reinforcement learning (RL) algorithm for online constrained Markov decision processes (CMDPs). Despite its widespread practical use, the existing theoretical literature on PD-RL algorithms for this problem only provides sublinear regret guarantees and fails to ensure convergence to optimal policies. In this paper, we introduce a novel policy gradient PD algorithm with uniform probably approximate correctness (Uniform-PAC) guarantees, simultaneously ensuring convergence to optimal policies, sublinear regret, and polynomial sample complexity for any target accuracy. Notably, this represents the first Uniform-PAC algorithm for the online CMDP problem. In addition to the theoretical guarantees, we empirically demonstrate in a simple CMDP that our algorithm converges to optimal policies, while baseline algorithms exhibit oscillatory performance and constraint violation.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes