LGJan 31, 2024

A Policy Gradient Primal-Dual Algorithm for Constrained MDPs with Uniform PAC Guarantees

Toshinori Kitamura, Tadashi Kozuno, Masahiro Kato, Yuki Ichihara, Soichiro Nishimori, Akiyoshi Sannai, Sho Sonoda, Wataru Kumagai, Yutaka Matsuo

arXiv:2401.17780v312.55 citationsh-index: 15Has Code

Originality Highly original

AI Analysis

This work addresses a key theoretical gap in reinforcement learning for constrained decision-making, providing the first Uniform-PAC algorithm for online CMDPs, which is incremental as it builds on existing primal-dual methods but with improved guarantees.

The paper tackles the problem of online constrained Markov decision processes (CMDPs) by introducing a policy gradient primal-dual algorithm that ensures convergence to optimal policies, sublinear regret, and polynomial sample complexity, with empirical validation showing it outperforms baselines by avoiding oscillations and constraint violations.

We study a primal-dual (PD) reinforcement learning (RL) algorithm for online constrained Markov decision processes (CMDPs). Despite its widespread practical use, the existing theoretical literature on PD-RL algorithms for this problem only provides sublinear regret guarantees and fails to ensure convergence to optimal policies. In this paper, we introduce a novel policy gradient PD algorithm with uniform probably approximate correctness (Uniform-PAC) guarantees, simultaneously ensuring convergence to optimal policies, sublinear regret, and polynomial sample complexity for any target accuracy. Notably, this represents the first Uniform-PAC algorithm for the online CMDP problem. In addition to the theoretical guarantees, we empirically demonstrate in a simple CMDP that our algorithm converges to optimal policies, while baseline algorithms exhibit oscillatory performance and constraint violation.

View on arXiv PDF Code

Similar